Dataset Mechanisms
SupervisableDataset
holds your data throughout the labeling process.Let's take a look at its core mechanisms.
Running Python right here
Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter
to execute.
Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:
To download a notebook file instead, visit here.
This page addresses single components of hover
We are using code snippets to pick out parts of the annotation interface, so that the documentation can explain what they do.
- Please be aware that this is NOT how one would typically use
hover
. - Typical usage deals with recipes where the individual parts have been tied together.
Dependencies for local environments
When you run the code locally, you may need to install additional packages.
To render bokeh
plots in Jupyter, you need:
pip install jupyter_bokeh
If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```
Data Subsets
We place unlabeled data and labeled data in different subsets: "raw", "train", "dev", and "test". Unlabeled data start from the "raw" subset, and can be transferred to other subsets after it gets labeled.
SupervisableDataset
uses a "population table", dataset.pop_table
, to show the size of each subset:
from hover.core.dataset import SupervisableTextDataset import pandas as pd raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv" train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv" # for fast, low-memory demonstration purpose, sample the data df_raw = pd.read_csv(raw_csv_path).sample(400) df_raw["SUBSET"] = "raw" df_train = pd.read_csv(train_csv_path).sample(400) df_train["SUBSET"] = "train" df_dev = pd.read_csv(train_csv_path).sample(100) df_dev["SUBSET"] = "dev" df_test = pd.read_csv(train_csv_path).sample(100) df_test["SUBSET"] = "test" # build overall dataframe and ensure feature type df = pd.concat([df_raw, df_train, df_dev, df_test]) df["text"] = df["text"].astype(str) # this class stores the dataset throught the labeling process dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")
from bokeh.io import show, output_notebook output_notebook() # normally your would skip notebook_url or use Jupyter address notebook_url = 'localhost:8888' # special configuration for this remotely hosted tutorial from local_lib.binder_helper import remote_jupyter_proxy_url notebook_url = remote_jupyter_proxy_url show(dataset.pop_table, notebook_url=notebook_url)
Transfer Data Between Subsets
COMMIT
and DEDUP
are the mechanisms that hover
uses to transfer data between subsets.
COMMIT
copies selected points (to be discussed later) to a destination subset- labeled-raw-only:
COMMIT
automatically detects which points are in the raw set with a valid label. Other points will not get copied. - keep-last: you can commit the same point to the same subset multiple times and the last copy will be kept. This can be useful for revising labels before
DEDUP
.
- labeled-raw-only:
DEDUP
removes duplicates (identified by feature value) across subsets- priority rule: test > dev > train > raw, i.e. test set data always gets kept during deduplication
FAQ
Why does COMMIT only work on the raw subset?
Most selections will happen through plots, where different subsets are on top of each other. This means selections can contain both unlabeled and labeled points.
Way too often we find ourselves trying to view both the labeled and the unlabeled, but only moving the unlabeled "raw" points. So it's handy that COMMIT picks those points only.
These mechanisms correspond to buttons in hover
's annotation interface, which you have encountered in the quickstart:
Showcase widgets here are not interactive
Plotted widgets on this page are not interactive, but only for illustration.
Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).
- be sure to use a whole
recipe
rather than individual widgets. - if you really want to plot interactive widgets on their own, try
from hover.utils.bokeh_helper import show_as_interactive as show
instead offrom bokeh.io import show
.- this works in your own environment but still not on the documentation page.
show_as_interactive
is a simple tweak ofbokeh.io.show
by turning standalone LayoutDOM to an application.
from bokeh.layouts import row, column show(column( row( dataset.data_committer, dataset.dedup_trigger, ), dataset.pop_table, ), notebook_url=notebook_url)
Of course, so far we have nothing to move, because there's no data selected. We shall now discuss selections.
Selection
hover
labels data points in bulk, which requires selecting groups of homogeneous data, i.e. semantically similar or going to have the same label. Being able to skim through what you selected gives you confidence about homogeneity.
Normally, selection happens through a plot (explorer
), as we have seen in the quickstart. For the purpose here, we will "cheat" and assign the selection programmatically:
dataset._callback_update_selection(dataset.dfs["raw"].loc[:10]) show(dataset.sel_table, notebook_url=notebook_url)
Edit Data Within a Selection
Often the points selected are not perfectly homogeneous, i.e. some outliers belong to a different label from the selected group overall. It would be helpful to EVICT
them, and SupervisableDataset
has a button for it.
Sometimes you may also wish to edit data values on the fly. In hover this is called PATCH
, and there also is a button for it.
- by default, labels can be edited but feature values cannot.
Let's plot the forementioned buttons along with the selection table. Toggle any number of rows in the table, then click the button to EVICT
or PATCH
those rows:
Showcase widgets here are not interactive
Plotted widgets on this page are not interactive, but only for illustration.
Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).
- be sure to use a whole
recipe
rather than individual widgets. - if you really want to plot interactive widgets on their own, try
from hover.utils.bokeh_helper import show_as_interactive as show
instead offrom bokeh.io import show
.- this works in your own environment but still not on the documentation page.
show_as_interactive
is a simple tweak ofbokeh.io.show
by turning standalone LayoutDOM to an application.
show(column( row( dataset.selection_evictor, dataset.selection_patcher, ), dataset.sel_table, ), notebook_url=notebook_url)