Using Recipes

The most common usage of hover is through built-in recipes like in the quickstart.

Let's explore another recipe -- an active learning example.

Running Python right here

Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter to execute.

Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:

To download a notebook file instead, visit here.

Dependencies for local environments

When you run the code locally, you may need to install additional packages.

To run the text embedding code on this page, you need:

pip install spacy
python -m spacy download en_core_web_md

To render bokeh plots in Jupyter, you need:

pip install jupyter_bokeh

If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```

Fundamentals

Hover recipes are functions that take a SupervisableDataset and return an annotation interface.

The SupervisableDataset is assumed to have some data and embeddings.

Recap: Data & Embeddings

Let's preprare a dataset with embeddings. This is almost the same as in the quickstart:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")

import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"].loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

Recipe-Specific Ingredient

Each recipe has different functionalities and potentially different signature.

To utilize active learning, we need to specify how to get a model in the loop.

hover considers the vectorizer as a "frozen" embedding and follows up with a neural network, which infers its own dimensionality from the vectorizer and the output classes.

This architecture named VectorNet is the (default) basis of active learning in hover.

Custom models

It is possible to use a model other than VectorNet or its subclass.

You will need to implement the following methods with the same signatures as VectorNet:

from hover.core.neural import VectorNet
from hover.utils.common_nn import LogisticRegression

# Create a model with vectorizer-NN architecture.
# model.pt will point to a PyTorch state dict (to be created)
# the label classes in the dataset can change, and vecnet can adjust to that
vecnet = VectorNet(vectorizer, LogisticRegression, "model.pt", dataset.classes)

# predict_proba accepts individual strings or list
# text -> vector -> class probabilities
# if no classes right now, will see an empty list
print(vecnet.predict_proba(text))
print(vecnet.predict_proba([text]))

Note how the callback dynamically takes dataset.classes, which means the model architecture will adapt when we add classes during annotation.

Apply Labels

Now we invoke the active_learning recipe.

Tips: how recipes work programmatically

In general, a recipe is a function taking a SupervisableDataset and other arguments based on its functionality.

Here are a few common recipes:

active_learningsimple_annotatorlinked_annotator

Display the dataset for annotation, putting a classification model in the loop.

Currently works most smoothly with VectorNet.

Param	Type	Description
`dataset`	`SupervisableDataset`	the dataset to link to
`vecnet`	`VectorNet`	model to use in the loop
`**kwargs`		forwarded to each Bokeh figure

Expected visual layout:

SupervisableDataset	BokehSoftLabelExplorer	BokehDataAnnotator	BokehDataFinder
manage data subsets	inspect model predictions	make annotations	search and filter

Source code in hover/recipes/experimental.py

@servable(title="Active Learning")
def active_learning(dataset, vecnet, **kwargs):
    """
    ???+ note "Display the dataset for annotation, putting a classification model in the loop."
        Currently works most smoothly with `VectorNet`.

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `vecnet`  | `VectorNet` | model to use in the loop          |
        | `**kwargs` |         | forwarded to each Bokeh figure       |

        Expected visual layout:

        | SupervisableDataset | BokehSoftLabelExplorer    | BokehDataAnnotator | BokehDataFinder     |
        | :------------------ | :------------------------ | :----------------- | :------------------ |
        | manage data subsets | inspect model predictions | make annotations   | search and filter   |
    """
    layout, _ = _active_learning(dataset, vecnet, **kwargs)
    return layout

Display the dataset with on a 2D map for annotation.

Param	Type	Description
`dataset`	`SupervisableDataset`	the dataset to link to
`**kwargs`		kwargs to forward to each Bokeh figure

Expected visual layout:

SupervisableDataset	BokehDataAnnotator
manage data subsets	make annotations

Source code in hover/recipes/stable.py

@servable(title="Simple Annotator")
def simple_annotator(dataset, **kwargs):
    """
    ???+ note "Display the dataset with on a 2D map for annotation."

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `**kwargs` |       | kwargs to forward to each Bokeh figure |

        Expected visual layout:

        | SupervisableDataset | BokehDataAnnotator |
        | :------------------ | :----------------- |
        | manage data subsets | make annotations   |
    """
    layout, _ = _simple_annotator(dataset, **kwargs)
    return layout

Display the dataset on a 2D map in two views, one for search and one for annotation.

Param	Type	Description
`dataset`	`SupervisableDataset`	the dataset to link to
`**kwargs`		kwargs to forward to each Bokeh figure

Expected visual layout:

SupervisableDataset	BokehDataFinder	BokehDataAnnotator
manage data subsets	search -> highlight	make annotations

Source code in hover/recipes/stable.py

@servable(title="Linked Annotator")
def linked_annotator(dataset, **kwargs):
    """
    ???+ note "Display the dataset on a 2D map in two views, one for search and one for annotation."

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `**kwargs` |       | kwargs to forward to each Bokeh figure |

        Expected visual layout:

        | SupervisableDataset | BokehDataFinder     | BokehDataAnnotator |
        | :------------------ | :------------------ | :----------------- |
        | manage data subsets | search -> highlight | make annotations   |
    """
    layout, _ = _linked_annotator(dataset, **kwargs)
    return layout

The recipe returns a handle function which bokeh can use to visualize an annotation interface in multiple settings.

from hover.recipes.experimental import active_learning

interactive_plot = active_learning(dataset, vecnet)

# ---------- SERVER MODE: for the documentation page ----------
# because this tutorial is remotely hosted, we need explicit serving to expose the plot to you
from local_lib.binder_helper import binder_proxy_app_url
from bokeh.server.server import Server
server = Server({'/my-app': interactive_plot}, port=5007, allow_websocket_origin=['*'], use_xheaders=True)
server.start()
# visit this URL printed in cell output to see the interactive plot; locally you would just do "https://localhost:5007/my-app"
binder_proxy_app_url('my-app', port=5007)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

Tips: annotation interface with multiple plots

Video guide: leveraging linked selection

Video guide: active learning

Text guide: active learning

Inspecting model predictions allows us to

get an idea of how the current set of annotations will likely teach the model.
locate the most valuable samples for further annotation.