Prodigy
Annotating data using Prodigy.
Last updated
Annotating data using Prodigy.
Last updated
Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.
Prodigy is a paid annotation tool. You will need a Prodigy is a paid tool. A license is required to download and use it with ZenML.
The Prodigy Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation.
If you need to label data as part of your ML workflow, that is the point at which you could consider adding the optional annotator stack component as part of your ZenML stack.
The Prodigy Annotator flavor is provided by the Prodigy ZenML integration. You need to install it to be able to register it as an Annotator and add it to your stack:
Note that you'll need to install Prodigy separately since it requires a license. Please visit the Prodigy docs for information on how to install it. Currently Prodigy also requires the urllib3<2
dependency, so make sure to install that.
Then register your annotator with ZenML:
See https://prodi.gy/docs/install#config for more on custom Prodigy config files. Passing a custom_config_path
allows you to override the default Prodigy config.
Finally, add all these components to a stack and set it as your active stack. For example:
Now if you run a simple CLI command like zenml annotator dataset list
this should work without any errors. You're ready to use your annotator in your ML workflow!
With Prodigy, there is no need to specially start the annotator ahead of time like with Label Studio. Instead, just use Prodigy as per the Prodigy docs and then you can use the ZenML wrapper / API to get your labeled data etc using our Python methods.
ZenML supports access to your data and annotations via the zenml annotator ...
CLI command.
You can access information about the datasets you're using with the zenml annotator dataset list
. To work on annotation for a particular dataset, you can run zenml annotator dataset annotate <DATASET_NAME> <CUSTOM_COMMAND>
. This is the equivalent of running prodigy <CUSTOM_COMMAND>
in the terminal. For example, you might run:
This would launch the Prodigy interface for the textcat.manual
recipe with the news_topics
dataset and the labels Technology
, Politics
, Economy
, and Entertainment
. The data would be loaded from the news_headlines.jsonl
file.
A common workflow for Prodigy is to annotate data as you would usually do, and then use the connection into ZenML to import those annotations within a step in your pipeline (if running locally). For example, within a ZenML step:
If you're running in a cloud environment, you can manually export the annotations, store them somewhere in a cloud environment and then reference or use those within ZenML. The precise way you do this will be very case-dependent, however, so it's difficult to provide a one-size-fits-all solution.
Our Prodigy annotator component inherits from the BaseAnnotator
class. There are some methods that are core methods that must be defined, like being able to register or get a dataset. Most annotators handle things like the storage of state and have their own custom features, so there are quite a few extra methods specific to Prodigy.
The core Prodigy functionality that's currently enabled from within the annotator
stack component interface includes a way to register your datasets and export any annotations for use in separate steps.