Skip to content

Dataset Mechanisms

SupervisableDataset holds your data throughout the labeling process.

🚤 Let's take a look at its core mechanisms.

Running Python right here

Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter to execute.

Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:

To download a notebook file instead, visit here.

This page addresses single components of hover

We are using code snippets to pick out parts of the annotation interface, so that the documentation can explain what they do.

  • Please be aware that this is NOT how one would typically use hover.
  • Typical usage deals with recipes where the individual parts have been tied together.
Dependencies for local environments

When you run the code locally, you may need to install additional packages.

To render bokeh plots in Jupyter, you need:

pip install jupyter_bokeh

If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```

Data Subsets

We place unlabeled data and labeled data in different subsets: "raw", "train", "dev", and "test". Unlabeled data start from the "raw" subset, and can be transferred to other subsets after it gets labeled.

SupervisableDataset uses a "population table", dataset.pop_table, to show the size of each subset:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")


from bokeh.io import show, output_notebook

output_notebook()

# normally your would skip notebook_url or use Jupyter address
notebook_url = 'localhost:8888'

# special configuration for this remotely hosted tutorial
from local_lib.binder_helper import remote_jupyter_proxy_url
notebook_url = remote_jupyter_proxy_url

show(dataset.pop_table, notebook_url=notebook_url)


Transfer Data Between Subsets

COMMIT and DEDUP are the mechanisms that hover uses to transfer data between subsets.

  • COMMIT copies selected points (to be discussed later) to a destination subset
    • labeled-raw-only: COMMIT automatically detects which points are in the raw set with a valid label. Other points will not get copied.
    • keep-last: you can commit the same point to the same subset multiple times and the last copy will be kept. This can be useful for revising labels before DEDUP.
  • DEDUP removes duplicates (identified by feature value) across subsets
    • priority rule: test > dev > train > raw, i.e. test set data always gets kept during deduplication
FAQ
Why does COMMIT only work on the raw subset?

Most selections will happen through plots, where different subsets are on top of each other. This means selections can contain both unlabeled and labeled points.

Way too often we find ourselves trying to view both the labeled and the unlabeled, but only moving the unlabeled "raw" points. So it's handy that COMMIT picks those points only.

These mechanisms correspond to buttons in hover's annotation interface, which you have encountered in the quickstart:

Showcase widgets here are not interactive

Plotted widgets on this page are not interactive, but only for illustration.

Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).

  • be sure to use a whole recipe rather than individual widgets.
  • if you really want to plot interactive widgets on their own, try from hover.utils.bokeh_helper import show_as_interactive as show instead of from bokeh.io import show.
    • this works in your own environment but still not on the documentation page.
    • show_as_interactive is a simple tweak of bokeh.io.show by turning standalone LayoutDOM to an application.
from bokeh.layouts import row, column

show(column(
    row(
        dataset.data_committer,
        dataset.dedup_trigger,
    ),
    dataset.pop_table,
), notebook_url=notebook_url)


Of course, so far we have nothing to move, because there's no data selected. We shall now discuss selections.

Selection

hover labels data points in bulk, which requires selecting groups of homogeneous data, i.e. semantically similar or going to have the same label. Being able to skim through what you selected gives you confidence about homogeneity.

Normally, selection happens through a plot (explorer), as we have seen in the quickstart. For the purpose here, we will "cheat" and assign the selection programmatically:

dataset._callback_update_selection(dataset.dfs["raw"].loc[:10])

show(dataset.sel_table, notebook_url=notebook_url)


Edit Data Within a Selection

Often the points selected are not perfectly homogeneous, i.e. some outliers belong to a different label from the selected group overall. It would be helpful to EVICT them, and SupervisableDataset has a button for it.

Sometimes you may also wish to edit data values on the fly. In hover this is called PATCH, and there also is a button for it.

  • by default, labels can be edited but feature values cannot.

Let's plot the forementioned buttons along with the selection table. Toggle any number of rows in the table, then click the button to EVICT or PATCH those rows:

Showcase widgets here are not interactive

Plotted widgets on this page are not interactive, but only for illustration.

Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).

  • be sure to use a whole recipe rather than individual widgets.
  • if you really want to plot interactive widgets on their own, try from hover.utils.bokeh_helper import show_as_interactive as show instead of from bokeh.io import show.
    • this works in your own environment but still not on the documentation page.
    • show_as_interactive is a simple tweak of bokeh.io.show by turning standalone LayoutDOM to an application.
show(column(
    row(
        dataset.selection_evictor,
        dataset.selection_patcher,
    ),
    dataset.sel_table,
), notebook_url=notebook_url)