Label Studio
How to annotate data using Label Studio with ZenML
Label Studio is one of the leading open-source annotation platforms available to data scientists and ML practitioners. It is used to create or edit datasets that you can then use as part of training or validation workflows. It supports a broad range of annotation types, including:
Computer Vision (image classification, object detection, semantic segmentation)
Audio & Speech (classification, speaker diarization, emotion recognition, audio transcription)
Text / NLP (classification, NER, question answering, sentiment analysis)
Time Series (classification, segmentation, event recognition)
Multi Modal / Domain (dialogue processing, OCR, time series with reference)
When would you want to use it?
If you need to label data as part of your ML workflow, that is the point at which you could consider adding in the optional annotator stack component as part of your ZenML stack.
The Label Studio integration currently is built to support workflows using the following three cloud artifact stores: AWS S3, GCP/GCS and Azure Blob Storage. Purely local stacks will currently not work if you want to do add the annotation stack component as part of your stack.
COMING SOON: The Label Studio Integration supports the use of annotations in an ML workflow, but we do not currently handle the universal conversion between data formats as part of the training workflow. Our initial use case was built to support image classification and object detection, but we will add helper steps and functions for other use cases in due course. We will update the docs when we enable this functionality.
How to deploy it?
The Label Studio Annotator flavor is provided by the Label Studio ZenML integration, you need to install it, to be able to register it as an Annotator and add it to your stack:
You will next need to obtain your Label Studio API key. This will give you access to the web annotation interface. (The following steps apply to a local instance of Label Studio, but feel free to obtain your API key directly from your deployed instance if that's what you are using.)
At this point you should register the API key under a custom secret name, making sure to replace the two parts in <>
with whatever you choose:
Then register your annotator with ZenML:
When using a deployed instance of Label Studio, the instance URL must be specified without any trailing /
at the end. You should specify the port, for example port 80 for a standard http connection.
Finally, add all these components to a stack and set it as your active stack. For example:
Now if you run a simple CLI command like zenml annotator dataset list
this should work without any errors. You're ready to use your annotator in your ML workflow!
How do you use it?
ZenML assumes that users have registered a cloud artifact store and an annotator as described above. ZenML currently only supports this setup, but we will add in the fully local stack option in the future.
ZenML supports access to your data and annotations via the zenml annotator ...
CLI command.
You can access information about the datasets you're using with the zenml annotator dataset list
. To work on annotation for a particular dataset, you can run zenml annotator dataset annotate <dataset_name>
.
Label Studio Annotator Stack Component
Our Label Studio annotator component inherits from the BaseAnnotator
class. There are some methods that are core methods that must be defined, like being able to register or get a dataset. Most annotators handle things like the storage of state and have their own custom features, so there are quite a few extra methods specific to Label Studio.
The core Label Studio functionality that's currently enabled includes a way to register your datasets, export any annotations for use in separate steps as well as to start the annotator daemon process. (Label Studio requires a server to be running in order to use the web interface, and ZenML handles the provisioning of this server locally using the details you passed in when registering the component, unless you've specified that you want to use a deployed instance.)
Standard Steps
ZenML offers some standard steps (and their associated config objects) which will get you up and running with the Label Studio integration quickly. These include:
LabelStudioDatasetRegistrationConfig
- a step config object to be used when registering a dataset with Label studio using theget_or_create_dataset
stepget_or_create_dataset
step - This takes aLabelStudioDatasetRegistrationConfig
config object which includes the name of the dataset. If it exists, this step will return the name, but if it doesn't exist then ZenML will register the dataset along with the appropriate label config with Label Studio.get_labeled_data
step - This step will get all labeled data available for a particular dataset. Note that these are output in a Label Studio annotation format, which will subsequently converted into a format appropriate for your specific use case.sync_new_data_to_label_studio
step - This step is for ensuring that ZenML is handling the annotations and the files being used are stored and synced with the ZenML cloud artifact store. This is an important step as part of a continuous annotation workflow since you want all the subsequent steps of your workflow to remain in sync with whatever new annotations are being made or have been created.
Helper Functions
Label Studio requires the use of what it calls 'label config' when you are creating/registering your dataset. These are strings containing HTML-like syntax that allow you to define a custom interface for your annotation. ZenML provides three helper functions that will construct these label config strings in the case of object detection, image classification and OCR. See the integrations.label_studio.label_config_generators
module for those three functions.
Last updated