Skip to content

Custom Labeling Functions

Suppose we have some custom functions for labeling or filtering data, which resembles snorkel's typical scenario.

🚤 Let's see how these functions can be combined with hover.

Running Python right here

Think of this page as almost a Jupyter notebook. You can edit code and press Shift+Enter to execute.

Behind the scene is a Binder-hosted Python environment. Below is the status of the kernel:

To download a notebook file instead, visit here.

This page addresses single components of hover

We are using code snippets to pick out parts of the annotation interface, so that the documentation can explain what they do.

  • Please be aware that this is NOT how one would typically use hover.
  • Typical usage deals with recipes where the individual parts have been tied together.
Dependencies for local environments

When you run the code locally, you may need to install additional packages.

To run the text embedding code on this page, you need:

pip install spacy
python -m spacy download en_core_web_md
To use snorkel labeling functions, you need:
pip install snorkel
To render bokeh plots in Jupyter, you need:
pip install jupyter_bokeh

If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
```

Preparation

As always, start with a ready-for-plot dataset:

from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")


import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)


Labeling Functions

Labeling functions are functions that take a pd.DataFrame row and return a label or abstain.

Inside the function one can do many things, but let's start with simple keywords wrapped in regex:

About the decorator @labeling_function

hover.utils.snorkel_helper.labeling_function(targets, label_encoder=None, **kwargs)

Hover's flavor of the Snorkel labeling_function decorator.

However, due to the dynamic label encoding nature of hover, the decorated function should return the original string label, not its encoding integer.

  • assigns a UUID for easy identification
  • keeps track of LF targets
Param Type Description
targets list of str labels that the labeling function is intended to create
label_encoder dict {decoded_label -> encoded_label} mapping, if you also want an original snorkel-style labeling function linked as a .snorkel attribute
**kwargs forwarded to snorkel's labeling_function()
Source code in hover/utils/snorkel_helper.py
def labeling_function(targets, label_encoder=None, **kwargs):
    """
    ???+ note "Hover's flavor of the Snorkel labeling_function decorator."
        However, due to the dynamic label encoding nature of hover,
        the decorated function should return the original string label, not its encoding integer.

        - assigns a UUID for easy identification
        - keeps track of LF targets

        | Param           | Type   | Description                          |
        | :-------------- | :----- | :----------------------------------- |
        | `targets`       | `list` of `str` | labels that the labeling function is intended to create |
        | `label_encoder` | `dict` | {decoded_label -> encoded_label} mapping, if you also want an original snorkel-style labeling function linked as a `.snorkel` attribute |
        | `**kwargs`      |        | forwarded to `snorkel`'s `labeling_function()` |
    """
    # lazy import so that the package does not require snorkel
    # Feb 3, 2022: snorkel's dependency handling is too strict
    # for other dependencies like NumPy, SciPy, SpaCy, etc.
    # Let's cite Snorkel and lazy import or copy functions.
    # DO NOT explicitly depend on Snorkel without confirming
    # that all builds/tests pass by Anaconda standards, else
    # we risk having to drop conda support.
    from snorkel.labeling import (
        labeling_function as snorkel_lf,
        LabelingFunction as SnorkelLF,
    )

    def wrapper(func):
        # set up kwargs for Snorkel's LF
        # a default name that can be overridden
        snorkel_kwargs = {"name": func.__name__}
        snorkel_kwargs.update(kwargs)

        # return value of hover's decorator
        lf = SnorkelLF(f=func, **snorkel_kwargs)

        # additional attributes
        lf.uuid = uuid.uuid1()
        lf.targets = targets[:]

        # link a snorkel-style labeling function if applicable
        if label_encoder:
            lf.label_encoder = label_encoder

            def snorkel_style_func(x):
                return lf.label_encoder[func(x)]

            lf.snorkel = snorkel_lf(**kwargs)(snorkel_style_func)
        else:
            lf.label_encoder = None
            lf.snorkel = None

        return lf

    return wrapper

from hover.utils.snorkel_helper import labeling_function
from hover.module_config import ABSTAIN_DECODED as ABSTAIN
import re


@labeling_function(targets=["rec.autos"])
def auto_keywords(row):
    flag = re.search(
        r"(?i)(diesel|gasoline|automobile|vehicle|drive|driving)", row.text
    )
    return "rec.autos" if flag else ABSTAIN


@labeling_function(targets=["rec.sport.baseball"])
def baseball_keywords(row):
    flag = re.search(r"(?i)(baseball|stadium|\ bat\ |\ base\ )", row.text)
    return "rec.sport.baseball" if flag else ABSTAIN


@labeling_function(targets=["sci.crypt"])
def crypt_keywords(row):
    flag = re.search(r"(?i)(crypt|math|encode|decode|key)", row.text)
    return "sci.crypt" if flag else ABSTAIN


@labeling_function(targets=["talk.politics.guns"])
def guns_keywords(row):
    flag = re.search(r"(?i)(gun|rifle|ammunition|violence|shoot)", row.text)
    return "talk.politics.guns" if flag else ABSTAIN


@labeling_function(targets=["misc.forsale"])
def forsale_keywords(row):
    flag = re.search(r"(?i)(sale|deal|price|discount)", row.text)
    return "misc.forsale" if flag else ABSTAIN


LABELING_FUNCTIONS = [
    auto_keywords,
    baseball_keywords,
    crypt_keywords,
    guns_keywords,
    forsale_keywords,
]


# we will come back to this block later on
# LABELING_FUNCTIONS.pop(-1)


Using a Function to Apply Labels

Hover's SnorkelExplorer (short as snorkel) can take the labeling functions above and apply them on areas of data that you choose. The widget below is responsible for labeling:

Showcase widgets here are not interactive

Plotted widgets on this page are not interactive, but only for illustration.

Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).

  • be sure to use a whole recipe rather than individual widgets.
  • if you really want to plot interactive widgets on their own, try from hover.utils.bokeh_helper import show_as_interactive as show instead of from bokeh.io import show.
    • this works in your own environment but still not on the documentation page.
    • show_as_interactive is a simple tweak of bokeh.io.show by turning standalone LayoutDOM to an application.
from bokeh.io import show, output_notebook

output_notebook()

# normally your would skip notebook_url or use Jupyter address
notebook_url = 'localhost:8888'

# special configuration for this remotely hosted tutorial
from local_lib.binder_helper import remote_jupyter_proxy_url
notebook_url = remote_jupyter_proxy_url

from hover.recipes.subroutine import standard_snorkel

snorkel_plot = standard_snorkel(dataset)
snorkel_plot.subscribed_lf_list = LABELING_FUNCTIONS
show(snorkel_plot.lf_apply_trigger, notebook_url=notebook_url)


Using a Function to Apply Filters

Any function that labels is also a function that filters. The filter condition is "keep if did not abstain". The widget below handles filtering:

Showcase widgets here are not interactive

Plotted widgets on this page are not interactive, but only for illustration.

Widgets will be interactive when you actually use them (in your local environment or server apps like in the quickstart).

  • be sure to use a whole recipe rather than individual widgets.
  • if you really want to plot interactive widgets on their own, try from hover.utils.bokeh_helper import show_as_interactive as show instead of from bokeh.io import show.
    • this works in your own environment but still not on the documentation page.
    • show_as_interactive is a simple tweak of bokeh.io.show by turning standalone LayoutDOM to an application.
show(snorkel_plot.lf_filter_trigger, notebook_url=notebook_url)


Unlike the toggled filters for finder and softlabel, filtering with functions is on a per-click basis. In other words, this particular filtration doesn't persist when you select another area.

Dynamic List of Functions

Python lists are mutable, and we are going to take advantage of that for improvising and editing labeling functions on the fly.

Run the block below and open the resulting URL to launch a recipe.

  • labeling functions are evaluated against the dev set.
    • hence you are advised to send the labels produced by these functions to the train set, not the dev set.
  • come back and edit the list of labeling functions in-place in one of the code cells above.
    • then go to the launched app and refresh the functions!
from hover.recipes.experimental import snorkel_crosscheck

interactive_plot = snorkel_crosscheck(dataset, LABELING_FUNCTIONS)

# ---------- SERVER MODE: for the documentation page ----------
# because this tutorial is remotely hosted, we need explicit serving to expose the plot to you
from local_lib.binder_helper import binder_proxy_app_url
from bokeh.server.server import Server
server = Server({'/my-app': interactive_plot}, port=5007, allow_websocket_origin=['*'], use_xheaders=True)
server.start()
# visit this URL printed in cell output to see the interactive plot; locally you would just do "https://localhost:5007/my-app"
binder_proxy_app_url('my-app', port=5007)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

What's really cool is that in your local environment, this update-and-refresh operation can be done all in a notebook. So now you can

  • interactively evaluate and revise labeling functions
  • visually assign specific data regions to apply those functions

which makes labeling functions significantly more accurate and applicable.