Skip to content

Using Recipes

The most common usage of hover is through built-in recipes like in the quickstart.

🎡 Let's explore another recipe -- an active learning example.

Running Python right here

Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter to execute.

Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:

Recap: Data & Embeddings

This is exactly the same as in the quickstart:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

example_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
# for fast demonstration purpose, sample the data
df_raw = pd.read_csv(example_csv_path).sample(2000)

# data is divided into 4 subsets: "raw" / "train" / "dev" / "test"
# this example assumes no labeled data available., i.e. only "raw"
df_raw["SUBSET"] = "raw"

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label")

# each subset can be accessed as its own DataFrame
dataset.dfs["raw"].head(5)


import spacy
import re

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"].loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")


# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
dataset.compute_2d_embedding(vectorizer, "umap")

# What we did adds 'x' and 'y' columns to the DataFrames in dataset.dfs
# One could alternatively pre-compute these columns using any approach
dataset.dfs["raw"].head(5)

Use Callbacks to Train a Model

To utilize active learning, we need to specify how to get a model in the loop.

hover considers the vectorizer as a "frozen" embedding and follows up with a neural network, which infers its own dimensionality from the vectorizer and the output classes.

  • This architecture named VectorNet is the (default) basis of active learning in hover.
Custom models

It is possible to use a model other than VectorNet or its subclass.

Simply implement the following methods with the same signatures as VectorNet:

from hover.core.neural import VectorNet
from hover.utils.common_nn import LogisticRegression

def vecnet_callback(dataset, vectorizer):
    """
    Create a model with vectorizer-NN architecture.
    """
    # model.pt will point to a PyTorch state dict (to be created)
    # which gets cumulatively updated when we train the model
    vecnet = VectorNet(vectorizer, LogisticRegression, "model.pt", dataset.classes)
    return vecnet

vecnet = vecnet_callback(dataset, vectorizer)

# predict_proba accepts individual strings or list
# text -> vector -> class probabilities
print(vecnet.predict_proba(text))
print(vecnet.predict_proba([text]))

Note how the callback dynamically takes dataset.classes, which means the model architecture will adapt when we add classes during annotation.

✨ Apply Labels

Now we invoke the active_learning recipe.

Tips: how recipes work programmatically

In general, a recipe is a function taking a SupervisableDataset and other arguments based on its functionality.

Here are a few common recipes:

Display the dataset for annotation, putting a classification model in the loop.

Currently works most smoothly with VectorNet.

Param Type Description
dataset SupervisableDataset the dataset to link to
vectorizer callable the feature -> vector function
vecnet_callback callable the (dataset, vectorizer) -> VecNet function
**kwargs kwargs to forward to each Bokeh figure

Expected visual layout:

SupervisableDataset BokehSoftLabelExplorer BokehDataAnnotator BokehDataFinder
manage data subsets inspect model predictions make annotations search -> highlight
Source code in hover/recipes/experimental.py
@servable(title="Active Learning")
def active_learning(dataset, vectorizer, vecnet_callback, **kwargs):
    """
    ???+ note "Display the dataset for annotation, putting a classification model in the loop."
        Currently works most smoothly with `VectorNet`.

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `vectorizer` | `callable` | the feature -> vector function  |
        | `vecnet_callback` | `callable` | the (dataset, vectorizer) -> `VecNet` function|
        | `**kwargs` |       | kwargs to forward to each Bokeh figure |

        Expected visual layout:

        | SupervisableDataset | BokehSoftLabelExplorer    | BokehDataAnnotator | BokehDataFinder     |
        | :------------------ | :------------------------ | :----------------- | :------------------ |
        | manage data subsets | inspect model predictions | make annotations   | search -> highlight |
    """
    layout, _ = _active_learning(dataset, vectorizer, vecnet_callback, **kwargs)
    return layout
Display the dataset with on a 2D map for annotation.
Param Type Description
dataset SupervisableDataset the dataset to link to
**kwargs kwargs to forward to each Bokeh figure

Expected visual layout:

SupervisableDataset BokehDataAnnotator
manage data subsets make annotations
Source code in hover/recipes/stable.py
@servable(title="Simple Annotator")
def simple_annotator(dataset, **kwargs):
    """
    ???+ note "Display the dataset with on a 2D map for annotation."

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `**kwargs` |       | kwargs to forward to each Bokeh figure |

        Expected visual layout:

        | SupervisableDataset | BokehDataAnnotator |
        | :------------------ | :----------------- |
        | manage data subsets | make annotations   |
    """
    layout, _ = _simple_annotator(dataset, **kwargs)
    return layout
Display the dataset on a 2D map in two views, one for search and one for annotation.
Param Type Description
dataset SupervisableDataset the dataset to link to
**kwargs kwargs to forward to each Bokeh figure

Expected visual layout:

SupervisableDataset BokehDataFinder BokehDataAnnotator
manage data subsets search -> highlight make annotations
Source code in hover/recipes/stable.py
@servable(title="Linked Annotator")
def linked_annotator(dataset, **kwargs):
    """
    ???+ note "Display the dataset on a 2D map in two views, one for search and one for annotation."

        | Param     | Type     | Description                          |
        | :-------- | :------- | :----------------------------------- |
        | `dataset` | `SupervisableDataset` | the dataset to link to  |
        | `**kwargs` |       | kwargs to forward to each Bokeh figure |

        Expected visual layout:

        | SupervisableDataset | BokehDataFinder     | BokehDataAnnotator |
        | :------------------ | :------------------ | :----------------- |
        | manage data subsets | search -> highlight | make annotations   |
    """
    layout, _ = _linked_annotator(dataset, **kwargs)
    return layout

The recipe returns a handle function which bokeh can use to visualize an annotation interface in multiple settings.

In-browser limitation

If running the code in your browser:

  • The annotation interface here is for demo only.
    • Due to event listener limitations of this page, it cannot trigger certain callbacks.
  • For a truly interactive example, please visit Binder app or Binder repo.
  • In the real Jupyter Lab/Notebook, this will be fully functional.
from hover.recipes.experimental import active_learning, _active_learning
from bokeh.io import show, output_notebook

output_notebook()

# ---------- DEMO CODE: for this documentation page ----------
# because this demo is remotely hosted, we need to handle proxy
def remote_jupyter_proxy_url(port):
    """
    Callable to configure Bokeh's show method when using a proxy (JupyterHub).
    """
    import os
    import urllib

    base_url = 'https://hub.gke2.mybinder.org/user/'
    host = urllib.parse.urlparse(base_url).netloc

    if port is None:
        return host

    service_url_path = os.environ['JUPYTERHUB_SERVICE_PREFIX']
    proxy_url_path = 'proxy/%d' % port

    user_url = urllib.parse.urljoin(base_url, service_url_path)
    full_url = urllib.parse.urljoin(user_url, proxy_url_path)
    return full_url

# static plot for demonstating the annotation interface
static_plot, plot_objects = _active_learning(dataset, vectorizer, vecnet_callback)
show(static_plot, notebook_url=remote_jupyter_proxy_url)

# ---------- REAL CODE: for your actual Jupyter environment ---------
# the real annotation interface enables Python callbacks
# interactive_plot = active_learning(dataset, vectorizer, vecnet_callback)
# show(interactive_plot, notebook_url='https://localhost:port')
Tips: annotation interface with multiple plots
Video guide: leveraging linked selection

Video guide: active learning

Text guide: active learning

Inspecting model predictions allows us to

  • get an idea of how the current set of annotations will likely teach the model.
  • locate the most valuable samples for further annotation.