Quickstart

Welcome to the basic use case of hover!

Let's say we want to label some data and call it a day.

Running Python right here

Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter to execute.

Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:

To download a notebook file instead, visit here.

Dependencies for local environments

When you run the code locally, you may need to install additional packages.

To run the text embedding code on this page, you need:

pip install spacy
python -m spacy download en_core_web_md

To render bokeh plots in Jupyter, you need:

pip install jupyter_bokeh

If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```

Ingredient 1 / 3: Raw Data

Start with a spreadsheet loaded in pandas.

We turn it into a SupervisableDataset designed for labeling:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

example_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(example_csv_path).sample(1000)
df_raw["text"] = df_raw["text"].astype(str)

# data is divided into 4 subsets: "raw" / "train" / "dev" / "test"
# this example assumes no labeled data available., i.e. only "raw"
df_raw["SUBSET"] = "raw"

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label")

# each subset can be accessed as its own DataFrame
dataset.dfs["raw"].head(5)

FAQ

What if I have multiple features?

feature_key refers to the field that will be vectorized later on, which can be a JSON that encloses multiple features.

For example, suppose our data entries look like this:

{"f1": "foo", "f2": "bar", "non_feature": "abc"}

We can put f1 and f2 in a JSON and convert the entries like this:

# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}

Can I use audio or image data?

Yes! Please check out the "Guides" section of the documentation.

Ingredient 2 / 3: Embedding

A pre-trained embedding lets us group data points semantically.

In particular, let's define a data -> embedding vector function.

import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"].loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")

Tips

Caching

dataset by itself stores the original features but not the corresponding vectors.

To avoid vectorizing the same feature again and again, we could simply do:

from functools import cache

@cache
def vectorizer(feature):
    # put code here

If you'd like to limit the size of the cache, something like @lru_cache(maxsize=10000) could help.

Check out functools for more options.

Vectorizing multiple features

Suppose we have multiple features enclosed in a JSON:

# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}

Also, suppose we have individual vectorizers likes this:

def vectorizer_1(feature_1):
    # put code here

def vectorizer_2(feature_2):
    # put code here

Then we can define a composite vectorizer:

import json
import numpy as np

def vectorizer(feature_json):
    data_dict = json.loads(feature_json)
    vectors = []
    for field, func in [
        ("f1", vectorizer_1),
        ("f2", vectorizer_2),
    ]:
        vectors.append(func(data_dict[field]))

    return np.concatenate(vectors)

Ingredient 3 / 3: 2D Embedding

We compute a 2D version of the pre-trained embedding to visualize the whole dataset.

Hover has built-in methods for calling umap or ivis.

Dependencies (when in your own environment)

The libraries for this step are not directly required by hover:

for umap: pip install umap-learn
for ivis: pip install ivis[cpu] or pip install ivis[gpu]

umap-learn is installed in this demo environment.

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

# what we did adds 'embed_2d_0' and 'embed_2d_1' columns to the DataFrames in dataset.dfs
dataset.dfs["raw"].head(5)

Apply Labels

We are ready for the annotation interface!

from hover.recipes.stable import simple_annotator

interactive_plot = simple_annotator(dataset)

# ---------- SERVER MODE: for the documentation page ----------
# because this tutorial is remotely hosted, we need explicit serving to expose the plot to you
from local_lib.binder_helper import binder_proxy_app_url
from bokeh.server.server import Server
server = Server({'/my-app': interactive_plot}, port=5007, allow_websocket_origin=['*'], use_xheaders=True)
server.start()
# visit this URL printed in cell output to see the interactive plot; locally you would just do "https://localhost:5007/my-app"
binder_proxy_app_url('my-app', port=5007)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

Tips: annotation interface basics

Video guide

Text guide

There should be a SupervisableDataset board on the left and an BokehDataAnnotator on the right, each with a few buttons.

SupervisableDatasetBokehDataAnnotator

push: push Dataset updates to the bokeh plots.
commit: add data entries selected in the Annotator to a specified subset.
dedup: deduplicate across subsets by feature (last in gets kept).
export: save your data (all subsets) in a specified format.

raw/train/dev/test: choose which subsets to display or hide.
apply: apply the label input to the selected points in the raw subset only.

We've essentially put the data into neighborboods based on the vectorizer, but the quality (homogeneity of labels) of such neighborhoods can vary.

hover over any data point to see its tooltip.
take advantage of different selection tools to apply labels at appropriate scales.
the search widget might turn out useful.
- note that it does not select points but highlights them.