Using Recipes
The most common usage of
hover
is through built-inrecipe
s like in the quickstart.Let's explore another
recipe
-- an active learning example.
Running Python right here
Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter
to execute.
Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:
To download a notebook file instead, visit here.
Dependencies for local environments
When you run the code locally, you may need to install additional packages.
To run the text embedding code on this page, you need:
pip install spacy
python -m spacy download en_core_web_md
bokeh
plots in Jupyter, you need:
pip install jupyter_bokeh
If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```
Fundamentals
Hover recipe
s are functions that take a SupervisableDataset
and return an annotation interface.
The SupervisableDataset
is assumed to have some data and embeddings.
Recap: Data & Embeddings
Let's preprare a dataset with embeddings. This is almost the same as in the quickstart:
from hover.core.dataset import SupervisableTextDataset import pandas as pd raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv" train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv" # for fast, low-memory demonstration purpose, sample the data df_raw = pd.read_csv(raw_csv_path).sample(400) df_raw["SUBSET"] = "raw" df_train = pd.read_csv(train_csv_path).sample(400) df_train["SUBSET"] = "train" df_dev = pd.read_csv(train_csv_path).sample(100) df_dev["SUBSET"] = "dev" df_test = pd.read_csv(train_csv_path).sample(100) df_test["SUBSET"] = "test" # build overall dataframe and ensure feature type df = pd.concat([df_raw, df_train, df_dev, df_test]) df["text"] = df["text"].astype(str) # this class stores the dataset throught the labeling process dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")
import spacy import re from functools import lru_cache # use your preferred embedding for the task nlp = spacy.load("en_core_web_md") # raw data (str in this case) -> np.array @lru_cache(maxsize=int(1e+4)) def vectorizer(text): clean_text = re.sub(r"[\s]+", r" ", str(text)) return nlp(clean_text, disable=nlp.pipe_names).vector text = dataset.dfs["raw"].loc[0, "text"] vec = vectorizer(text) print(f"Text: {text}") print(f"Vector shape: {vec.shape}")
# any kwargs will be passed onto the corresponding reduction # for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html # for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)
Recipe-Specific Ingredient
Each recipe has different functionalities and potentially different signature.
To utilize active learning, we need to specify how to get a model in the loop.
hover
considers the vectorizer
as a "frozen" embedding and follows up with a neural network, which infers its own dimensionality from the vectorizer and the output classes.
- This architecture named
VectorNet
is the (default) basis of active learning inhover
.
Custom models
It is possible to use a model other than VectorNet
or its subclass.
You will need to implement the following methods with the same signatures as VectorNet
:
from hover.core.neural import VectorNet from hover.utils.common_nn import LogisticRegression # Create a model with vectorizer-NN architecture. # model.pt will point to a PyTorch state dict (to be created) # the label classes in the dataset can change, and vecnet can adjust to that vecnet = VectorNet(vectorizer, LogisticRegression, "model.pt", dataset.classes) # predict_proba accepts individual strings or list # text -> vector -> class probabilities # if no classes right now, will see an empty list print(vecnet.predict_proba(text)) print(vecnet.predict_proba([text]))
Note how the callback dynamically takes dataset.classes
, which means the model architecture will adapt when we add classes during annotation.
Apply Labels
Now we invoke the active_learning
recipe.
Tips: how recipes work programmatically
In general, a recipe
is a function taking a SupervisableDataset
and other arguments based on its functionality.
Here are a few common recipes:
Display the dataset for annotation, putting a classification model in the loop.
Currently works most smoothly with VectorNet
.
Param | Type | Description |
---|---|---|
dataset |
SupervisableDataset |
the dataset to link to |
vecnet |
VectorNet |
model to use in the loop |
**kwargs |
forwarded to each Bokeh figure |
Expected visual layout:
SupervisableDataset | BokehSoftLabelExplorer | BokehDataAnnotator | BokehDataFinder |
---|---|---|---|
manage data subsets | inspect model predictions | make annotations | search and filter |
Source code in hover/recipes/experimental.py
@servable(title="Active Learning")
def active_learning(dataset, vecnet, **kwargs):
"""
???+ note "Display the dataset for annotation, putting a classification model in the loop."
Currently works most smoothly with `VectorNet`.
| Param | Type | Description |
| :-------- | :------- | :----------------------------------- |
| `dataset` | `SupervisableDataset` | the dataset to link to |
| `vecnet` | `VectorNet` | model to use in the loop |
| `**kwargs` | | forwarded to each Bokeh figure |
Expected visual layout:
| SupervisableDataset | BokehSoftLabelExplorer | BokehDataAnnotator | BokehDataFinder |
| :------------------ | :------------------------ | :----------------- | :------------------ |
| manage data subsets | inspect model predictions | make annotations | search and filter |
"""
layout, _ = _active_learning(dataset, vecnet, **kwargs)
return layout
Display the dataset with on a 2D map for annotation.
Param | Type | Description |
---|---|---|
dataset |
SupervisableDataset |
the dataset to link to |
**kwargs |
kwargs to forward to each Bokeh figure |
Expected visual layout:
SupervisableDataset | BokehDataAnnotator |
---|---|
manage data subsets | make annotations |
Source code in hover/recipes/stable.py
@servable(title="Simple Annotator")
def simple_annotator(dataset, **kwargs):
"""
???+ note "Display the dataset with on a 2D map for annotation."
| Param | Type | Description |
| :-------- | :------- | :----------------------------------- |
| `dataset` | `SupervisableDataset` | the dataset to link to |
| `**kwargs` | | kwargs to forward to each Bokeh figure |
Expected visual layout:
| SupervisableDataset | BokehDataAnnotator |
| :------------------ | :----------------- |
| manage data subsets | make annotations |
"""
layout, _ = _simple_annotator(dataset, **kwargs)
return layout
Display the dataset on a 2D map in two views, one for search and one for annotation.
Param | Type | Description |
---|---|---|
dataset |
SupervisableDataset |
the dataset to link to |
**kwargs |
kwargs to forward to each Bokeh figure |
Expected visual layout:
SupervisableDataset | BokehDataFinder | BokehDataAnnotator |
---|---|---|
manage data subsets | search -> highlight | make annotations |
Source code in hover/recipes/stable.py
@servable(title="Linked Annotator")
def linked_annotator(dataset, **kwargs):
"""
???+ note "Display the dataset on a 2D map in two views, one for search and one for annotation."
| Param | Type | Description |
| :-------- | :------- | :----------------------------------- |
| `dataset` | `SupervisableDataset` | the dataset to link to |
| `**kwargs` | | kwargs to forward to each Bokeh figure |
Expected visual layout:
| SupervisableDataset | BokehDataFinder | BokehDataAnnotator |
| :------------------ | :------------------ | :----------------- |
| manage data subsets | search -> highlight | make annotations |
"""
layout, _ = _linked_annotator(dataset, **kwargs)
return layout
The recipe returns a handle
function which bokeh
can use to visualize an annotation interface in multiple settings.
from hover.recipes.experimental import active_learning interactive_plot = active_learning(dataset, vecnet) # ---------- SERVER MODE: for the documentation page ---------- # because this tutorial is remotely hosted, we need explicit serving to expose the plot to you from local_lib.binder_helper import binder_proxy_app_url from bokeh.server.server import Server server = Server({'/my-app': interactive_plot}, port=5007, allow_websocket_origin=['*'], use_xheaders=True) server.start() # visit this URL printed in cell output to see the interactive plot; locally you would just do "https://localhost:5007/my-app" binder_proxy_app_url('my-app', port=5007) # ---------- NOTEBOOK MODE: for your actual Jupyter environment --------- # this code will render the entire plot in Jupyter # from bokeh.io import show, output_notebook # output_notebook() # show(interactive_plot, notebook_url='https://localhost:8888')
Tips: annotation interface with multiple plots
Video guide: leveraging linked selection
Video guide: active learning
Text guide: active learning
Inspecting model predictions allows us to
- get an idea of how the current set of annotations will likely teach the model.
- locate the most valuable samples for further annotation.