Quickstart
Welcome to the basic use case of
hover
!Let's say we want to label some data and call it a day.
Running Python right here
Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter
to execute.
Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:
To download a notebook file instead, visit here.
Dependencies for local environments
When you run the code locally, you may need to install additional packages.
To run the text embedding code on this page, you need:
pip install spacy
python -m spacy download en_core_web_md
bokeh
plots in Jupyter, you need:
pip install jupyter_bokeh
If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```
Ingredient 1 / 3: Raw Data
Start with a spreadsheet loaded in pandas
.
We turn it into a SupervisableDataset
designed for labeling:
from hover.core.dataset import SupervisableTextDataset import pandas as pd example_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv" # for fast, low-memory demonstration purpose, sample the data df_raw = pd.read_csv(example_csv_path).sample(1000) df_raw["text"] = df_raw["text"].astype(str) # data is divided into 4 subsets: "raw" / "train" / "dev" / "test" # this example assumes no labeled data available., i.e. only "raw" df_raw["SUBSET"] = "raw" # this class stores the dataset throught the labeling process dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label") # each subset can be accessed as its own DataFrame dataset.dfs["raw"].head(5)
FAQ
What if I have multiple features?
feature_key
refers to the field that will be vectorized later on, which can be a JSON that encloses multiple features.
For example, suppose our data entries look like this:
{"f1": "foo", "f2": "bar", "non_feature": "abc"}
We can put f1
and f2
in a JSON and convert the entries like this:
# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
Can I use audio or image data?
Yes! Please check out the "Guides" section of the documentation.
Ingredient 2 / 3: Embedding
A pre-trained embedding lets us group data points semantically.
In particular, let's define a data -> embedding vector
function.
import spacy import re from functools import lru_cache # use your preferred embedding for the task nlp = spacy.load("en_core_web_md") # raw data (str in this case) -> np.array @lru_cache(maxsize=int(1e+4)) def vectorizer(text): clean_text = re.sub(r"[\s]+", r" ", str(text)) return nlp(clean_text, disable=nlp.pipe_names).vector text = dataset.dfs["raw"].loc[0, "text"] vec = vectorizer(text) print(f"Text: {text}") print(f"Vector shape: {vec.shape}")
Tips
Caching
dataset
by itself stores the original features but not the corresponding vectors.
To avoid vectorizing the same feature again and again, we could simply do:
from functools import cache
@cache
def vectorizer(feature):
# put code here
If you'd like to limit the size of the cache, something like @lru_cache(maxsize=10000)
could help.
Check out functools for more options.
Vectorizing multiple features
Suppose we have multiple features enclosed in a JSON:
# could also keep f1 and f2 around
{'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
Also, suppose we have individual vectorizers likes this:
def vectorizer_1(feature_1):
# put code here
def vectorizer_2(feature_2):
# put code here
Then we can define a composite vectorizer:
import json
import numpy as np
def vectorizer(feature_json):
data_dict = json.loads(feature_json)
vectors = []
for field, func in [
("f1", vectorizer_1),
("f2", vectorizer_2),
]:
vectors.append(func(data_dict[field]))
return np.concatenate(vectors)
Ingredient 3 / 3: 2D Embedding
We compute a 2D version of the pre-trained embedding to visualize the whole dataset.
Hover has built-in methods for calling umap or ivis.
Dependencies (when in your own environment)
The libraries for this step are not directly required by hover
:
- for umap:
pip install umap-learn
- for ivis:
pip install ivis[cpu]
orpip install ivis[gpu]
umap-learn
is installed in this demo environment.
# any kwargs will be passed onto the corresponding reduction # for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html # for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2) # what we did adds 'embed_2d_0' and 'embed_2d_1' columns to the DataFrames in dataset.dfs dataset.dfs["raw"].head(5)
Apply Labels
We are ready for the annotation interface!
from hover.recipes.stable import simple_annotator interactive_plot = simple_annotator(dataset) # ---------- SERVER MODE: for the documentation page ---------- # because this tutorial is remotely hosted, we need explicit serving to expose the plot to you from local_lib.binder_helper import binder_proxy_app_url from bokeh.server.server import Server server = Server({'/my-app': interactive_plot}, port=5007, allow_websocket_origin=['*'], use_xheaders=True) server.start() # visit this URL printed in cell output to see the interactive plot; locally you would just do "https://localhost:5007/my-app" binder_proxy_app_url('my-app', port=5007) # ---------- NOTEBOOK MODE: for your actual Jupyter environment --------- # this code will render the entire plot in Jupyter # from bokeh.io import show, output_notebook # output_notebook() # show(interactive_plot, notebook_url='https://localhost:8888')
Tips: annotation interface basics
Video guide
Text guide
There should be a SupervisableDataset
board on the left and an BokehDataAnnotator
on the right, each with a few buttons.
push
: pushDataset
updates to the bokeh plots.commit
: add data entries selected in theAnnotator
to a specified subset.dedup
: deduplicate across subsets byfeature
(last in gets kept).export
: save your data (all subsets) in a specified format.
raw
/train
/dev
/test
: choose which subsets to display or hide.apply
: apply thelabel
input to the selected points in theraw
subset only.
We've essentially put the data into neighborboods based on the vectorizer, but the quality (homogeneity of labels) of such neighborhoods can vary.
- hover over any data point to see its tooltip.
- take advantage of different selection tools to apply labels at appropriate scales.
- the search widget might turn out useful.
- note that it does not select points but highlights them.