Skip to content

Commit

Permalink
Merge pull request #1193 from kritinv/tutorial-updates
Browse files Browse the repository at this point in the history
Tutorial Updates
  • Loading branch information
penguine-ip authored Nov 28, 2024
2 parents 29b3c8b + 44ac047 commit 33b5d9d
Show file tree
Hide file tree
Showing 6 changed files with 37 additions and 56 deletions.
8 changes: 4 additions & 4 deletions docs/docs/tutorial-dataset-synthesis.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ styling_config = StylingConfig(
In addition to styling, DeepEval lets you **customize** other parts of the generation process, from context construction to data evolutions.
:::

### 3. Goldens generation
### Goldens generation

With our configurations defined, let’s finally begin **generating synthetic goldens**. You can generate as many `goldens_per_context` as you’d like. For this tutorial, we’ll set this parameter to 2, as coverage across different contexts is more important.

Expand Down Expand Up @@ -85,7 +85,7 @@ You can see that even though the input for this synthetic golden is simple, it r
You can increase the complexity of the generated goldens by configuring the **evolution settings** when initializing the `Synthesizer` object.
:::

### 3. Additional Styling Configurations
### Additional Styling Configurations

It's also important to be exploring additional styling configurations when generating your datasets. Using multiple styling configurations allows you to generate a truly **diverse dataset** that is not only comprehensvie but also captures edge cases.

Expand Down Expand Up @@ -143,7 +143,7 @@ dataest.push(alias="Ambiguous Synthetic Test")

Generating synthetic data from documents requires a knowledge base, meaning the generated goldens are designed to test user queries that prompt the LLM to use the RAG engine. However, since our medical chatbot operates as an Agentic RAG, there are cases where the LLM **does not invoke the RAG tool**, necessitating the generation of data from scratch without any context.

### 1. Defining Style Configuration
### Defining Style Configuration

Similar to generating from documents, you'll want to **customize the output style and format** of any `input` and/or `expected_output` when generating synthetic goldens from scratch. When generating from scratch, your creativity is your limit. You can test your LLM for any interaction you can foresee. In the example below, we'll define user inputs to try to book an appointment by providing name information and email.

Expand All @@ -169,7 +169,7 @@ styling_config = StylingConfig(
)
```

### 2. Generating the Goldens
### Generating the Goldens

The next step is to simply initialize your synthesizer with the styling configurations and push the dataset to Confident AI for review.

Expand Down
12 changes: 6 additions & 6 deletions docs/docs/tutorial-llm-application-example.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ In this section, we will be developing an **Agentic RAG** medical chatbot to rec
You may use whatever **knowledge base** you have to power your RAG Engine, but for the purposes of this tutorial, we'll be using [The Gale Encyclopedia of Alternative Medicine](https://staibabussalamsula.ac.id/wp-content/uploads/2024/06/The-Gale-Encyclopedia-of-Medicine-3rd-Edition-staibabussalamsula.ac_.id_.pdf).
:::

## 1. Setting Up
## Setting Up

Begin by installing the necessary packages. We'll use `llama-index` as our RAG framework and `chromadb` for vector indexing.

Expand All @@ -33,7 +33,7 @@ class MedicalAppointment(BaseModel):
symptoms: Optional[str] = None
diagnosis: Optional[str] = None
```
## 2. Defining the Chatbot
## Defining the Chatbot

Next, we'll create a `MedicalAppointmentSystem` class to represent our agent. This class will store all `MedicalAppointment` instances in an `appointments` dictionary, with each key representing a unique user.

Expand All @@ -44,7 +44,7 @@ class MedicalAppointmentSystem:
```
As we progress through this tutorial, we'll gradually enhance this class until it evolves into a fully functional medical chatbot agent.

## 3. Indexing the Knowledge Base
## Indexing the Knowledge Base

Let's start by building our **RAG engine**, which will handle all patient diagnoses. The first step is to load the relevant medical information chunks from our knowledge base into the system. We'll use the `SimpleDirectoryReader` from `llama-index` to accomplish this.

Expand Down Expand Up @@ -89,7 +89,7 @@ class MedicalAppointmentSystem:
self.index = VectorStoreIndex.from_documents(self.documents, storage_context=storage_context)
```

## 4. Building the Tools
## Building the Tools

Finally, we'll create the tools for our chatbot, which includes our **RAG engine** and **function-calling tools** responsible for creating, updating, and managing medical appointments, ensuring the system is both dynamic and interactive.

Expand Down Expand Up @@ -181,7 +181,7 @@ def record_diagnosis(self, appointment_id: str, diagnosis: str) -> str:
return "Diagnosis cannot be recorded. Please tell me more about your symptoms."
```

## 5. Assembling the Chatbot
## Assembling the Chatbot

Now that we have set up the tools and data systems, it's time to assemble the chatbot agent. We'll use LlamaIndex's `FunctionCallingAgent` to dynamically manage user interactions and choose the appropriate tool based on the input and context. This involves defining the LLM, system prompt, and tool integrations.

Expand Down Expand Up @@ -221,7 +221,7 @@ class MedicalAppointmentSystem:
)
```

## 6. Setting up the Interactive Session
## Setting up the Interactive Session

Finally, we'll create an interactive environment where users can engage with the chatbot. This involves configuring input/output, managing conversation flow, and processing user queries.

Expand Down
4 changes: 2 additions & 2 deletions docs/docs/tutorial-metrics-confident.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Log in to Confident AI by heading to the [platform](https://app.confident-ai.com
deepeval login
```

## 1. Creating your Custom Metrics
## Creating your Custom Metrics
To create a complete experiment, you'll first need to define your custom metrics, if applicable. In our medical chatbot use-case, we'll be defining 2: **Diagnosis Specificity** and **Overdiagnosis**. Start by navigating to the Metrics page, selecting the Custom Metrics tab, and clicking Create Metric.

<div
Expand Down Expand Up @@ -87,7 +87,7 @@ Once you've finished defining all your custom metrics, they'll appear here like
</div>


## 2. Creating an Experiment
## Creating an Experiment

Next, head to the Evaluation & Testing page and click create new experiment, where you'll be presented with all the available metrics on DeepEval as well as the custom ones you've defined.

Expand Down
7 changes: 4 additions & 3 deletions docs/docs/tutorial-metrics-selection.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ In this section, we’ll be selecting the **LLM evaluation metrics** for our med
1. **Directly addressing the user:** The chatbot should directly address users' requests
2. **Providing accurate diagnoses:** Diagnoses must be reliable and based on the provided symptoms
3. **Providing professional responses:** Responses should be clear and respectful
### 1. Answer Relevancy

### Answer Relevancy

Let's start with our first metric, which will evaluate our medical chatbot against our first criterion:
```
Expand All @@ -30,7 +31,7 @@ Fortunately, DeepEval provides an out-of-the-box `AnswerRelevancy` metric, which
The `AnswerRelevancyMetric` uses an LLM to extract all statements from the `actual_output` and then classifies each statement's relevance to the `input` using the same LLM.
:::

### 2. Faithfulness
### Faithfulness

Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our **RAG tool needs improvement**.

Expand All @@ -47,7 +48,7 @@ DeepEval offers a total of **5 RAG metrics** to evaluate your RAG pipeline. To l
:::


### 3. Custom Metric - Professionalism
### Custom Metric - Professionalism

Our final metric will address Criterion 3, focusing on evaluating our chatbot's **professionalism**.

Expand Down
36 changes: 8 additions & 28 deletions docs/docs/tutorial-production-evaluation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,10 @@ sidebar_label: Evaluating LLMs in Production
---
## Quick Summary

In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to **evaluating LLMs in production**, there are three key aspects to focus on:
In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to **evaluating LLMs in production**, there are 2 key aspects to focus on:

- [Online Evaluations](confident-ai-llm-monitoring-evaluations)
- [Human-in-the-Loop feedback](confident-ai-human-feedback)
- [Guardrails](confident-ai-guardrails)

:::info
Unless you're planning to leverage user-feedback and set up guardrails, you've already set up **everything you need in code** for evaluations in production!
:::

Before we begin, first make sure you are logged in to Confident AI:

Expand All @@ -31,16 +26,9 @@ It's important to note that metrics in production are **reference-less metrics**
### Human-in-the-Loop Feedback
Human feedback goes beyond domain experts or dedicated reviewers—it also includes direct input from your users. This kind of feedback is essential for refining your model's performance. We’ll discuss how to collect and leverage user feedback in greater detail in the following sections.

### Guardrails
Finally, guardrails are a quick and effective way to safeguard your LLM’s responses from producing harmful or inappropriate outputs. While they may not be as accurate as online evaluations (due to the trade-off between speed and precision), they play a critical role in preventing devastating responses that could damage your company’s reputation.

:::info
While online evaluation metrics are **lagging**—occurring after an LLM generates a response—guardrails are **leading**, as they evaluate the response before it is sent to the user.
:::

## Setting up Online Evaluations

### 1. OpenAI API key
### OpenAI API key

It's extremely simple to set up online evaluations on Confident AI. Simply navigate to the settings page and input your `OPENAI_API_KEY`. This allows Confident AI to generate evaluation scores using OpenAI models.

Expand Down Expand Up @@ -69,7 +57,7 @@ While Confident AI uses OpenAI models by default, the platform fully supports **
/>
</div>

### 2. Turn on your Metrics
### Turn on your Metrics

Next, navigate to the **Online Evaluations** page and scroll down to view the list of available referenceless metrics. Here, you can toggle metrics on or off, adjust thresholds for each metric, and optionally enable strict mode.

Expand Down Expand Up @@ -103,7 +91,7 @@ Once the metrics are enabled, all incoming responses will be evaluated automatic

## Human-in-the-Loop Evaluation

### 1. Metric-based Filtering
### Metric-based Filtering

Notice that in the previous step, we toggled the following metrics: Answer Relevancy, Faithfulness, Bias, and Contextual Relevancy. Let's say we're trying to evaluate how our retriever (RAG engine tool) is performing in production. We'll need to look at all the responses that didn't pass the 0.5 threshold for **Contextual Relevancy**.

Expand Down Expand Up @@ -155,7 +143,7 @@ We'll examine this specific response, where our medical chatbot retrieved some i
/>
</div>

### 2. Inspecting Metric Scores
### Inspecting Metric Scores

Navigate to the **Metrics** tab in the side panel to view the metric scores in detail. As with any metric in DeepEval, online evaluation metrics are supported by reasoning and detailed logs, enabling anyone reviewing these responses to easily understand why a specific metric is failing and trace the steps through the score calculations.

Expand Down Expand Up @@ -207,7 +195,7 @@ Whether this contextual relevancy failure indicates a need to reduce `chunk_size
/>
</div>

### 3. Leaving Human Feedback
### Leaving Human Feedback

For each response, you'll find an option to leave feedback above the various sub-tabs. For this particular response, let's assign **a rating of 2 stars**, citing the lack of comprehensive context leading to an unclear diagnosis. However, the answer remains relevant, unbiased, and faithful.

Expand Down Expand Up @@ -259,7 +247,7 @@ You can also leave feedback on entire conversations instead of individual respon
/>
</div>

### 4. Inspecting Human Feedback
### Inspecting Human Feedback

All feedback, whether individual or conversational, can be accessed on the **Human Feedback** page. Here, you can filter feedback based on various criteria such as provider, rating, expected response, and more. To add responses to a dataset, simply check the relevant feedback, go to actions, and click add response to dataset.

Expand Down Expand Up @@ -316,7 +304,7 @@ It may be helpful to **categorize different types of failing feedback** into sep
</div>


### 5. User Provided Feedback
### User Provided Feedback

In addition to leaving feedback from the developer's side, you can also set up your LLM to receive user feedback with just one line of code. Here's how to set it up:

Expand Down Expand Up @@ -374,11 +362,3 @@ class MedicalAppointmentSystem():
:::info
**Balancing user satisfaction with the level of detail in feedback** is essential. For instance, while we provide a rating scale from 1 to 5, we simplify it into a binary option: whether the user was satisfied or not.
:::

## Guardrails

Guardrails provide a way to safeguard your LLM against generating harmful or undesirable responses. They work by applying fast evaluation metrics to monitor and prevent the generation of unsafe or inappropriate outputs, especially in response to harmful inputs.

:::info
**Guardrails** is an enterprise feature. For more information, please contact [email protected].
:::
26 changes: 13 additions & 13 deletions docs/docs/tutorial-production-monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ sidebar_label: Monitoring LLMs in Production

While we've thoroughly tested our medical chatbot, it's absolutely necessary to **continue monitoring and evaluating your LLM applications post-production**. This is crucial for improving your LLM application and identifying bugs as well as areas of improvement. Confident AI offers a complete suite of features to help you and your team easily monitor your LLMs in production, including:

- [Response Monitoring](#confident-ai-llm-monitoring)
- [Online Evaluations](#confident-ai-llm-monitoring-evaluations)
- [LLM Tracing](#confident-ai-tracing)
- [Integrating Human Feedback](#confident-ai-human-feedback)
- [Placing Guardrails](#confident-ai-guardrails)
- [Response Monitoring](confident-ai-llm-monitoring)
- [Online Evaluations](confident-ai-llm-monitoring-evaluations)
- [LLM Tracing](confident-ai-tracing)
- [Integrating Human Feedback](confident-ai-human-feedback)
- [Placing Guardrails](confident-ai-guardrails)

In this section, we'll be focusing on setting up our medical chatbot application for response monitoring and tracing.

Expand All @@ -28,7 +28,7 @@ deepeval login

## Setting up using DeepEval

### 1. Setting up your Monitor Function
### Setting up your Monitor Function

Let’s remind ourselves of the main interactive chat method within our `MedicalAppointmentSystem`. We’ll be enhancing this function to monitor real-time responses in production.

Expand Down Expand Up @@ -92,7 +92,7 @@ In addition to logging mandatory parameters such as `event_name`, `model`, `inpu
If your use case involves a chatbot and is conversational, logging `conversation_id` allows you to analyze, evaluate, and view entire conversational threads.
:::

### 2. Setting up Tracing
### Setting up Tracing

Next, we’ll set up tracing for our medical chatbot. Since the medical diagnosis bot is built using LlamaIndex, integrating tracing is as simple as adding a **single line of code**. The same applies for LangChain-based LLM applications, making it extremely quick and easy to get started.

Expand All @@ -111,7 +111,7 @@ Now, when you deploy your LLM application, you’ll gain full visibility into th
No matter how you built your application—whether using another framework or from scratch—you can create **custom (and even hybrid) traces** in DeepEval. Explore this [complete guide to LLM Tracing](confident-ai-tracing) to learn how to set it up.
:::

### 3. Tracing and Monitor
### Tracing and Monitor

DeepEval allows you to control what you monitor while leveraging its tracing integrations. To enable hybrid tracing for production response monitoring, you need to call the `monitor()` method at the end of the root trace block.

Expand Down Expand Up @@ -155,7 +155,7 @@ class MedicalAppointmentSystem:
```
The LLM application is now fully set up for production monitoring and ready for deployment. In the next sections, we’ll explore how to make the most of Confident AI’s features for effective production monitoring.

### 4. Example Interaction
### Example Interaction

Before diving into the platform, let’s first examine a _mock conversation_ between our medical chatbot and a hypothetical user, Jacob. Jacob is experiencing mild headaches and wants to schedule an appointment with a doctor to address his concerns. We’ll use the Observatory to analyze and review this conversation in the upcoming sections.

Expand Down Expand Up @@ -227,7 +227,7 @@ In this section, we’ll be specifically focusing on how to **view responses and
/>
</div>

### 1. Filtering
### Filtering

Let's start by filtering for all the responses from the conversation and model we want to analyze. We'll filter for the conversation thread set up earlier, `conversation_id="conversation111"`, and focus specifically on responses generated by the GPT-4o model.

Expand Down Expand Up @@ -255,7 +255,7 @@ In production, deploying multiple versions of your chatbot simultaneously enable
:::


### 2. Inspecting Each Response
### Inspecting Each Response

To inspect each monitored response in more detail, simply click on the corresponding row. The fields displayed in the dropdown **align with the parameters we chose to log** in our `monitor` function. Since we did not log token count or token usage, and this specific response did not prompt our chatbot’s RAG engine tool, the fields for token usage, cost, and retrieval are empty.

Expand All @@ -282,7 +282,7 @@ To inspect each monitored response in more detail, simply click on the correspon
You're also able to view **hyperparameters and custom data**, should you choose to log them, next to the *All Properties* tab and click *Inspect* to explore the test data in greater detail, including its traces.
:::

### 3. Detailed Inspection
### Detailed Inspection

Clicking **Inspect** opens a side drawer where you can review your response in greater detail. You’ll find tabs for default parameters, hyperparameters, and custom data, as well as additional tabs for metrics and feedback.

Expand Down Expand Up @@ -312,7 +312,7 @@ For now, let’s click on **View Trace** to see the full trace of the path our L
</div>


### 4. Viewing Traces
### Viewing Traces
We’ll examine the trace of this specific interaction between Jacob and our medical chatbot.

<div
Expand Down

0 comments on commit 33b5d9d

Please sign in to comment.