Skip to content

Quickstart

Welcome to the simplest use case of hover!

😎 Let's say we want to label some data and call it a day.

Ingredient 1 / 3: Some Data

Suppose that we have a list of data entries, each in the form of a dictionary.

We can first create a SupervisableDataset based on those entries:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

df_raw = pd.read_csv("https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv")

# data is divided into 4 subsets: "raw" / "train" / "dev" / "test"
# this example assumes no labeled data available., i.e. only "raw"
df_raw["SUBSET"] = "raw"

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label")

# each subset can be accessed as its own DataFrame
dataset.dfs["raw"].head(5)
FAQ
What if I have multiple features?

feature_key refers to the field that will be vectorized later on, which can be a JSON that encloses multiple features.

For example, suppose our data entries look like this:

{"f1": "foo", "f2": "bar", "non_feature": "abc"}

We can put f1 and f2 in a JSON and convert the entries like this:

# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}

Can I use audio or image data?

In the not-too-far future, yes!

Some mechanisms can get tricky with audios/images, but we are working on it:

Ingredient 2 / 3: Vectorizer

To put our dataset sensibly on a 2-D "map", we will use a vectorizer for feature extraction, and then perform dimensionality reduction.

Here's one of many ways to define a vectorizer:

import spacy
import re

nlp = spacy.load("en_core_web_md")

def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", text)
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"].loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")
Tips
Caching

dataset by itself stores the original features but not the corresponding vectors.

To avoid vectorizing the same feature again and again, we could simply do:

from functools import cache

@cache
def vectorizer(feature):
    # put code here

If you'd like to limit the size of the cache, something like @lru_cache(maxsize=10000) could help.

Check out functools for more options.

Vectorizing multiple features

Suppose we have multiple features enclosed in a JSON:

# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}

Also, suppose we have individual vectorizers likes this:

def vectorizer_1(feature_1):
    # put code here

def vectorizer_2(feature_2):
    # put code here

Then we can define a composite vectorizer:

import json
import numpy as np

def vectorizer(feature_json):
    data_dict = json.loads(feature_json)
    vectors = []
    for field, func in [
        ("f1", vectorizer_1),
        ("f2", vectorizer_2),
    ]:
        vectors.append(func(data_dict[field]))

    return np.concatenate(vectors)

Ingredient 3 / 3: Reduction

The dataset has built-in high-level support for dimensionality reduction.

Currently we can use umap or ivis.

Optional dependencies

The corresponding libraries do not ship with hover by default, and may need to be installed:

  • for umap: pip install umap-learn
  • for ivis: pip install ivis[cpu] or pip install ivis[gpu]

umap-learn is installed in this demo environment.

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
dataset.compute_2d_embedding(vectorizer, "umap")

# What we did adds 'x' and 'y' columns to the DataFrames in dataset.dfs
# One could alternatively pre-compute these columns using any approach
dataset.dfs["raw"].head(5)

✨ Apply Labels

Now we are ready to visualize and annotate!

Known issue

If you are running this code block on this documentation page:

  • JavaScript output (which contains the visualization) will fail to render due to JupyterLab's security restrictions.
  • please run this tutorial locally to view the output.
Advanced: help wanted

Some context:

  • the code blocks here are embedded using Juniper.
  • the environment is configured in the Binder repo.

What we've tried:

  • 1 Bokeh's extension with JupyterLab
    • 1.1 cannot render the Bokeh plots remotely with show(handle), with or without the extension
      • 1.1.1 JavaScript console suggests that bokeh.main.js would fail to load.
  • 2 JavaScript magic cell
    • 2.1 such magic is functional in a custom notebook on the Jupyter server.
    • 2.2 such magic is blocked by JupyterLab if ran on the documentation page.

Tentative clues:

  • 2.1 & 2.2 suggests that somehow JupyterLab behaves differently between Binder itself and Juniper.
  • Juniper by default trusts the cells.
  • making Javascript magic work on this documentation page would be a great step.
from hover.recipes import simple_annotator
from bokeh.io import show, output_notebook

# 'handle' is a function that renders elements in bokeh documents
handle = simple_annotator(dataset)

output_notebook()
show(handle, notebook_url='http://localhost:8888')
Tips: annotation interface basics
Video guide

Text guide

There should be a SupervisableDataset board on the left and an BokehDataAnnotator on the right, each with a few buttons.

  • push: push Dataset updates to the bokeh plots.
  • commit: add data entries selected in the Annotator to a specified subset.
  • dedup: deduplicate across subsets by feature (last in gets kept).
  • raw/train/dev/test: choose which subsets to display or hide.
  • apply: apply the label input to the selected points in the raw subset only.
  • export: save your data (all subsets) in a specified format.

We've essentially put the data into neighborboods based on the vectorizer, but the quality (homogeneity of labels) of such neighborhoods can vary.

  • hover over any data point to see its tooltip.
  • take advantage of different selection tools to apply labels at appropriate scales.
  • the search widget might turn out useful.
    • note that it does not select points but highlights them.