How to annotate data using Label Studio with ZenML
Label Studio is one of the leading open-source annotation platforms available to data scientists and ML practitioners. It is used to create or edit datasets that you can then use as part of training or validation workflows. It supports a broad range of annotation types, including:
- Computer Vision (image classification, object detection, semantic segmentation)
- Audio & Speech (classification, speaker diarization, emotion recognition, audio transcription)
- Text / NLP (classification, NER, question answering, sentiment analysis)
- Time Series (classification, segmentation, event recognition)
- Multi Modal / Domain (dialogue processing, OCR, time series with reference)
If you need to label data as part of your ML workflow, that is the point at which you could consider adding in the optional annotator stack component as part of your ZenML stack.
We currently support the use of annotation at the various stages described in the main annotators docs page, and also offer custom utility functions to generate Label Studio label config files for image classification and object detection. (More will follow in due course.)
The Label Studio integration currently is built to support workflows using the following three cloud artifact stores: AWS S3, GCP/GCS and Azure Blob Storage. Purely local stacks will currently not work if you want to do add the annotation stack component as part of your stack.
The Label Studio Annotator flavor is provided by the Label Studio ZenML integration, you need to install it, to be able to register it as an Annotator and add it to your stack:
zenml integration install label_studio
The following instructions below offer a general guide to the parts that need attention when deploying / using the Label Studio stack component and integration. The
label_studio_annotationexample offers a detailed guide for each cloud provider on how to set it up.
Before registering a
label_studioflavor stack component as part of your stack, you'll need to have registered a cloud artifact store and a secrets manager to handle authentication with Label Studio as well as any secrets required for the Artifact Store. (See the docs on how to register and setup a cloud artifact store as well as a secrets manager.)
Be sure to register an secret schema for whichever artifact store you choose, and then you should make sure to pass the name of that secret into the artifact store as the
--authentication_secretas described in this guide, for example in the case of AWS.
You will next need to obtain your Label Studio API key. This will give you access to the web annotation interface. (The following steps apply to a local instance of Label Studio, but feel free to obtain your API key directly from your deployed instance if that's what you are using.)
# choose a username and password for your label-studio account
label-studio reset_password --username <USERNAME> --password <PASSWORD>
# start a temporary / one-off label-studio instance to get your API key
label-studio start -p 8094
At this point you should register the API key with your secrets manager under a custom secret name, making sure to replace the two parts in
<>with whatever you choose:
zenml secrets-manager secret register <LABEL_STUDIO_SECRET_NAME> --api_key="<your_label_studio_api_key>"
Then register your annotator with ZenML:
zenml annotator register label_studio --flavor label_studio --authentication_secret="<LABEL_STUDIO_SECRET_NAME>"
# for deployed instances of Label Studio, you can also pass in the URL as follows, for example:
# zenml annotator register label_studio --flavor label_studio --authentication_secret="<LABEL_STUDIO_SECRET_NAME>" --instance_url="<your_label_studio_url>" --port=80
When using a deployed instance of Label Studio, the instance URL must be specified without any trailing
/at the end. You should specify the port, for example port 80 for a standard http connection.
Finally, add all these components to a stack and set it as your active stack. For example:
zenml stack copy annotation
zenml stack update annotation -x <YOUR_SECRETS_MANAGER> -a <YOUR_CLOUD_ARTIFACT_STORE>
# this must be done separately so that the other required stack components are first registered
zenml stack update annotation -an <YOUR_LABEL_STUDIO_ANNOTATOR>
zenml stack set annotation
# optionally also
zenml stack describe
Now if you run a simple CLI command like
zenml annotator dataset listthis should work without any errors. You're ready to use your annotator in your ML workflow!
ZenML assumes that users have registered a cloud artifact store, a secrets manager and an annotator as described above. ZenML currently only supports this setup, but we will add in the fully local stack option in the future.
ZenML supports access to your data and annotations via the
zenml annotator ...CLI command.
You can access information about the datasets you're using with the
zenml annotator dataset list. To work on annotation for a particular dataset, you can run
zenml annotator dataset annotate <dataset_name>.
Our full continuous annotation / training example is the best place to see how all the pieces of making this integration work fit together. What follows is an overview of some key components to the Label Studio integration and how it can be used.
Our Label Studio annotator component inherits from the
BaseAnnotatorclass. There are some methods that are core methods that must be defined, like being able to register or get a dataset. Most annotators handle things like the storage of state and have their own custom features, so there are quite a few extra methods specific to Label Studio.
The core Label Studio functionality that's currently enabled includes a way to register your datasets, export any annotations for use in separate steps as well as to start the annotator daemon process. (Label Studio requires a server to be running in order to use the web interface, and ZenML handles the provisioning of this server locally using the details you passed in when registering the component, unless you've specified that you want to use a deployed instance.)
ZenML offers some standard steps (and their associated config objects) which will get you up and running with the Label Studio integration quickly. These include:
LabelStudioDatasetRegistrationConfig- a step config object to be used when registering a dataset with Label studio using the
LabelStudioDatasetSyncConfig- a step config object to be used when registering a dataset with Label studio using the
sync_new_data_to_label_studiostep. Note that this requires a secret schema to have been pre-registered with your artifact store as being the one that holds authentication secrets specific to your particular cloud provider. (Label Studio provides some documentation on what permissions these secrets require here.)
get_or_create_datasetstep - This takes a
LabelStudioDatasetRegistrationConfigconfig object which includes the name of the dataset. If it exists, this step will return the name, but if it doesn't exist then ZenML will register the dataset along with the appropriate label config with Label Studio.
get_labeled_datastep - This step will get all labeled data available for a particular dataset. Note that these are output in a Label Studio annotation format, which will subsequently converted into a format appropriate for your specific use case.
sync_new_data_to_label_studiostep - This step is for ensuring that ZenML is handling the annotations and the files being used are stored and synced with the ZenML cloud artifact store. This is an important step as part of a continuous annotation workflow since you want all the subsequent steps of your workflow to remain in sync with whatever new annotations are being made or have been created.
Label Studio requires the use of what it calls 'label config' when you are creating/registering your dataset. These are strings containing HTML-like syntax that allow you to define a custom interface for your annotation. ZenML provides three helper functions that will construct these label config strings in the case of object detection, image classification and OCR. See the
integrations.label_studio.label_config_generatorsmodule for those three functions.