Skip to content

Commit

Permalink
Images & apply review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
nataliaElv committed Nov 21, 2024
1 parent 3f54096 commit b2bf23b
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 17 deletions.
6 changes: 3 additions & 3 deletions chapters/en/chapter10/1.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Introduction to Argilla[[introduction-to-argilla]]

In Chapter 5 you learnt how to build a dataset using the 🤗 Datasets library and in Chapter 6 you explored how to fine-tune models for some common NLP tasks. In this chapter, you will learn how to use Argilla to **curate datasets** that you can use to train and evaluate your models.
In Chapter 5 you learnt how to build a dataset using the 🤗 Datasets library and in Chapter 6 you explored how to fine-tune models for some common NLP tasks. In this chapter, you will learn how to use Argilla to **annotate and curate datasets** that you can use to train and evaluate your models.

The key to training models that perform well is to have high-quality data. Although there are some good datasets in the Hub that you could use to train and evaluate your models, these may not be relevant for your specific application or use case. In this scenario, you may want to build and curate a dataset of your own. Argilla will help you to do this efficiently.

With Argilla you can:

Expand All @@ -9,8 +11,6 @@ With Argilla you can:
- gather **human feedback** for LLMs and multi-modal models.
- invite experts to collaborate with you in Argilla, or crowdsource annotations!

The key to training models that perform well is to have high-quality data. Although there are some good datasets in the Hub that you could use to train and evaluate your models, these may not be relevant for your specific application or use case. In this scenario, you may want to build and curate a dataset of your own. Argilla will help you to do this efficiently.

Here are some of the things that you will learn in this chapter:

- How to set up your own Argilla instance.
Expand Down
23 changes: 11 additions & 12 deletions chapters/en/chapter10/4.mdx
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# Annotate your dataset

🚧 WIP 🚧

##TODO: Add screenshots!

Now it is time to start working from the Argilla UI to annotate our dataset.

## Align your team with annotation guidelines
Expand All @@ -12,6 +8,10 @@ Before you start annotating your dataset, it is always good practice to write so

In Argilla, you can go to your dataset settings page in the UI and modify the guidelines and the descriptions of your questions to help with alignment.

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_dataset_settings.png" alt="Screenshot of the Dataset Settings page in Argilla." width="80%"/>

If you want to dive deeper into the topic of how to write good guidelines, we recommend reading [this blogpost](https://argilla.io/blog/annotation-guidelines-practices) and the bibliographical references mentioned there.

## Distribute the task

In the dataset settings page, you can also change the dataset distribution settings. This will help you annotate more efficiently when you're working as part of a team. The default value for the minimum submitted responses is 1, meaning that as soon as a record has 1 submitted response it will be considered complete and count towards the progress in your dataset.
Expand All @@ -23,18 +23,17 @@ Sometimes, you want to have more than one submitted response per record, for exa
>[!TIP]
>💡 If you are deploying Argilla in a Hugging Face Space, any team members will be able to log in using the Hugging Face OAuth. Otherwise, you may need to create users for them following [this guide](https://docs.argilla.io/latest/how_to_guides/user/).
When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That's because in the previous section we mapped our question called `label` to the `label_text` column in the dataset, so that we simply need to review and correct the already existing labels. For the token classification, we'll need to add all labels manually, as we didn't include any suggestions.
When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That's because in the previous section we mapped our question called `label` to the `label_text` column in the dataset, so that we simply need to review and correct the already existing labels:

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_initial%20dataset.png" alt="Screenshot of the dataset in Argilla." width="80%"/>

For the token classification, we'll need to add all labels manually, as we didn't include any suggestions. This is how it might look after the span annotations:

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_dataset_with_spans.png" alt="Screenshot of the dataset in Argilla with spans annotated." width="80%"/>

As you move through the different records, there are different actions you can take:
- submit your responses, once you're done with the record.
- save them as a draft, in case you want to come back to them later.
- discard them, if the record souldn't be part of the dataset or you won't give responses to it.

In the next section, you will learn how you can export and use those annotations.

---
Examples of images from other chapters:
<a class="flex justify-center" href="/huggingface-course/bert-finetuned-ner">
<img class="block dark:hidden lg:w-3/5" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/model-eval-bert-finetuned-ner.png" alt="One-hot encoded labels for question answering."/>
<img class="hidden dark:block lg:w-3/5" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/model-eval-bert-finetuned-ner-dark.png" alt="One-hot encoded labels for question answering."/>
</a>
3 changes: 1 addition & 2 deletions chapters/en/chapter10/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ filtered_records = dataset.records(status_filter)
```

>[!TIP]
>⚠️ Note that the records could have more than one response and that each of them can have any status from `submitted`, `draft` or `discarded`.
>⚠️ Note that the records with `completed` status (i.e., records that meet the minimum submitted responses configured in the task distribution settings) could have more than one response and that each response can have any status from `submitted`, `draft` or `discarded`.
Learn more about querying and filtering records in the [Argilla docs](https://docs.argilla.io/latest/how_to_guides/query/).

Expand Down
4 changes: 4 additions & 0 deletions chapters/en/chapter10/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ Let's test what you learned in this chapter!
{
text: "Train your model",
explain: "You cannot train a model directly in Argilla, but you can use the data you curate in Argilla to train your own model",
},
{
text: "Generate synthetic datasets",
explain: "To generate synthetic datasets, you can use the distilabel package and then use Argilla to review and curate the generated data.",
}
]}
/>
Expand Down

0 comments on commit b2bf23b

Please sign in to comment.