diff --git a/docs/docs/tutorial-dataset-synthesis.mdx b/docs/docs/tutorial-dataset-synthesis.mdx index ce1452af8..845663e40 100644 --- a/docs/docs/tutorial-dataset-synthesis.mdx +++ b/docs/docs/tutorial-dataset-synthesis.mdx @@ -44,7 +44,7 @@ styling_config = StylingConfig( In addition to styling, DeepEval lets you **customize** other parts of the generation process, from context construction to data evolutions. ::: -### 3. Goldens generation +### Goldens generation With our configurations defined, let’s finally begin **generating synthetic goldens**. You can generate as many `goldens_per_context` as you’d like. For this tutorial, we’ll set this parameter to 2, as coverage across different contexts is more important. @@ -85,7 +85,7 @@ You can see that even though the input for this synthetic golden is simple, it r You can increase the complexity of the generated goldens by configuring the **evolution settings** when initializing the `Synthesizer` object. ::: -### 3. Additional Styling Configurations +### Additional Styling Configurations It's also important to be exploring additional styling configurations when generating your datasets. Using multiple styling configurations allows you to generate a truly **diverse dataset** that is not only comprehensvie but also captures edge cases. @@ -143,7 +143,7 @@ dataest.push(alias="Ambiguous Synthetic Test") Generating synthetic data from documents requires a knowledge base, meaning the generated goldens are designed to test user queries that prompt the LLM to use the RAG engine. However, since our medical chatbot operates as an Agentic RAG, there are cases where the LLM **does not invoke the RAG tool**, necessitating the generation of data from scratch without any context. -### 1. Defining Style Configuration +### Defining Style Configuration Similar to generating from documents, you'll want to **customize the output style and format** of any `input` and/or `expected_output` when generating synthetic goldens from scratch. When generating from scratch, your creativity is your limit. You can test your LLM for any interaction you can foresee. In the example below, we'll define user inputs to try to book an appointment by providing name information and email. @@ -169,7 +169,7 @@ styling_config = StylingConfig( ) ``` -### 2. Generating the Goldens +### Generating the Goldens The next step is to simply initialize your synthesizer with the styling configurations and push the dataset to Confident AI for review. diff --git a/docs/docs/tutorial-llm-application-example.mdx b/docs/docs/tutorial-llm-application-example.mdx index 133f8a2cb..5520782f3 100644 --- a/docs/docs/tutorial-llm-application-example.mdx +++ b/docs/docs/tutorial-llm-application-example.mdx @@ -12,7 +12,7 @@ In this section, we will be developing an **Agentic RAG** medical chatbot to rec You may use whatever **knowledge base** you have to power your RAG Engine, but for the purposes of this tutorial, we'll be using [The Gale Encyclopedia of Alternative Medicine](https://staibabussalamsula.ac.id/wp-content/uploads/2024/06/The-Gale-Encyclopedia-of-Medicine-3rd-Edition-staibabussalamsula.ac_.id_.pdf). ::: -## 1. Setting Up +## Setting Up Begin by installing the necessary packages. We'll use `llama-index` as our RAG framework and `chromadb` for vector indexing. @@ -33,7 +33,7 @@ class MedicalAppointment(BaseModel): symptoms: Optional[str] = None diagnosis: Optional[str] = None ``` -## 2. Defining the Chatbot +## Defining the Chatbot Next, we'll create a `MedicalAppointmentSystem` class to represent our agent. This class will store all `MedicalAppointment` instances in an `appointments` dictionary, with each key representing a unique user. @@ -44,7 +44,7 @@ class MedicalAppointmentSystem: ``` As we progress through this tutorial, we'll gradually enhance this class until it evolves into a fully functional medical chatbot agent. -## 3. Indexing the Knowledge Base +## Indexing the Knowledge Base Let's start by building our **RAG engine**, which will handle all patient diagnoses. The first step is to load the relevant medical information chunks from our knowledge base into the system. We'll use the `SimpleDirectoryReader` from `llama-index` to accomplish this. @@ -89,7 +89,7 @@ class MedicalAppointmentSystem: self.index = VectorStoreIndex.from_documents(self.documents, storage_context=storage_context) ``` -## 4. Building the Tools +## Building the Tools Finally, we'll create the tools for our chatbot, which includes our **RAG engine** and **function-calling tools** responsible for creating, updating, and managing medical appointments, ensuring the system is both dynamic and interactive. @@ -181,7 +181,7 @@ def record_diagnosis(self, appointment_id: str, diagnosis: str) -> str: return "Diagnosis cannot be recorded. Please tell me more about your symptoms." ``` -## 5. Assembling the Chatbot +## Assembling the Chatbot Now that we have set up the tools and data systems, it's time to assemble the chatbot agent. We'll use LlamaIndex's `FunctionCallingAgent` to dynamically manage user interactions and choose the appropriate tool based on the input and context. This involves defining the LLM, system prompt, and tool integrations. @@ -221,7 +221,7 @@ class MedicalAppointmentSystem: ) ``` -## 6. Setting up the Interactive Session +## Setting up the Interactive Session Finally, we'll create an interactive environment where users can engage with the chatbot. This involves configuring input/output, managing conversation flow, and processing user queries. diff --git a/docs/docs/tutorial-metrics-confident.mdx b/docs/docs/tutorial-metrics-confident.mdx index ad421a3f4..098824eba 100644 --- a/docs/docs/tutorial-metrics-confident.mdx +++ b/docs/docs/tutorial-metrics-confident.mdx @@ -18,7 +18,7 @@ Log in to Confident AI by heading to the [platform](https://app.confident-ai.com deepeval login ``` -## 1. Creating your Custom Metrics +## Creating your Custom Metrics To create a complete experiment, you'll first need to define your custom metrics, if applicable. In our medical chatbot use-case, we'll be defining 2: **Diagnosis Specificity** and **Overdiagnosis**. Start by navigating to the Metrics page, selecting the Custom Metrics tab, and clicking Create Metric.
-## 2. Creating an Experiment +## Creating an Experiment Next, head to the Evaluation & Testing page and click create new experiment, where you'll be presented with all the available metrics on DeepEval as well as the custom ones you've defined. diff --git a/docs/docs/tutorial-metrics-selection.mdx b/docs/docs/tutorial-metrics-selection.mdx index be09e22f9..2df1c18c9 100644 --- a/docs/docs/tutorial-metrics-selection.mdx +++ b/docs/docs/tutorial-metrics-selection.mdx @@ -16,7 +16,8 @@ In this section, we’ll be selecting the **LLM evaluation metrics** for our med 1. **Directly addressing the user:** The chatbot should directly address users' requests 2. **Providing accurate diagnoses:** Diagnoses must be reliable and based on the provided symptoms 3. **Providing professional responses:** Responses should be clear and respectful -### 1. Answer Relevancy + +### Answer Relevancy Let's start with our first metric, which will evaluate our medical chatbot against our first criterion: ``` @@ -30,7 +31,7 @@ Fortunately, DeepEval provides an out-of-the-box `AnswerRelevancy` metric, which The `AnswerRelevancyMetric` uses an LLM to extract all statements from the `actual_output` and then classifies each statement's relevance to the `input` using the same LLM. ::: -### 2. Faithfulness +### Faithfulness Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our **RAG tool needs improvement**. @@ -47,7 +48,7 @@ DeepEval offers a total of **5 RAG metrics** to evaluate your RAG pipeline. To l ::: -### 3. Custom Metric - Professionalism +### Custom Metric - Professionalism Our final metric will address Criterion 3, focusing on evaluating our chatbot's **professionalism**. diff --git a/docs/docs/tutorial-production-evaluation.mdx b/docs/docs/tutorial-production-evaluation.mdx index f28cc2402..0b9b82001 100644 --- a/docs/docs/tutorial-production-evaluation.mdx +++ b/docs/docs/tutorial-production-evaluation.mdx @@ -5,15 +5,10 @@ sidebar_label: Evaluating LLMs in Production --- ## Quick Summary -In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to **evaluating LLMs in production**, there are three key aspects to focus on: +In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to **evaluating LLMs in production**, there are 2 key aspects to focus on: - [Online Evaluations](confident-ai-llm-monitoring-evaluations) - [Human-in-the-Loop feedback](confident-ai-human-feedback) -- [Guardrails](confident-ai-guardrails) - -:::info -Unless you're planning to leverage user-feedback and set up guardrails, you've already set up **everything you need in code** for evaluations in production! -::: Before we begin, first make sure you are logged in to Confident AI: @@ -31,16 +26,9 @@ It's important to note that metrics in production are **reference-less metrics** ### Human-in-the-Loop Feedback Human feedback goes beyond domain experts or dedicated reviewers—it also includes direct input from your users. This kind of feedback is essential for refining your model's performance. We’ll discuss how to collect and leverage user feedback in greater detail in the following sections. -### Guardrails -Finally, guardrails are a quick and effective way to safeguard your LLM’s responses from producing harmful or inappropriate outputs. While they may not be as accurate as online evaluations (due to the trade-off between speed and precision), they play a critical role in preventing devastating responses that could damage your company’s reputation. - -:::info -While online evaluation metrics are **lagging**—occurring after an LLM generates a response—guardrails are **leading**, as they evaluate the response before it is sent to the user. -::: - ## Setting up Online Evaluations -### 1. OpenAI API key +### OpenAI API key It's extremely simple to set up online evaluations on Confident AI. Simply navigate to the settings page and input your `OPENAI_API_KEY`. This allows Confident AI to generate evaluation scores using OpenAI models. @@ -69,7 +57,7 @@ While Confident AI uses OpenAI models by default, the platform fully supports ** />
-### 2. Turn on your Metrics +### Turn on your Metrics Next, navigate to the **Online Evaluations** page and scroll down to view the list of available referenceless metrics. Here, you can toggle metrics on or off, adjust thresholds for each metric, and optionally enable strict mode. @@ -103,7 +91,7 @@ Once the metrics are enabled, all incoming responses will be evaluated automatic ## Human-in-the-Loop Evaluation -### 1. Metric-based Filtering +### Metric-based Filtering Notice that in the previous step, we toggled the following metrics: Answer Relevancy, Faithfulness, Bias, and Contextual Relevancy. Let's say we're trying to evaluate how our retriever (RAG engine tool) is performing in production. We'll need to look at all the responses that didn't pass the 0.5 threshold for **Contextual Relevancy**. @@ -155,7 +143,7 @@ We'll examine this specific response, where our medical chatbot retrieved some i /> -### 2. Inspecting Metric Scores +### Inspecting Metric Scores Navigate to the **Metrics** tab in the side panel to view the metric scores in detail. As with any metric in DeepEval, online evaluation metrics are supported by reasoning and detailed logs, enabling anyone reviewing these responses to easily understand why a specific metric is failing and trace the steps through the score calculations. @@ -207,7 +195,7 @@ Whether this contextual relevancy failure indicates a need to reduce `chunk_size /> -### 3. Leaving Human Feedback +### Leaving Human Feedback For each response, you'll find an option to leave feedback above the various sub-tabs. For this particular response, let's assign **a rating of 2 stars**, citing the lack of comprehensive context leading to an unclear diagnosis. However, the answer remains relevant, unbiased, and faithful. @@ -259,7 +247,7 @@ You can also leave feedback on entire conversations instead of individual respon /> -### 4. Inspecting Human Feedback +### Inspecting Human Feedback All feedback, whether individual or conversational, can be accessed on the **Human Feedback** page. Here, you can filter feedback based on various criteria such as provider, rating, expected response, and more. To add responses to a dataset, simply check the relevant feedback, go to actions, and click add response to dataset. @@ -316,7 +304,7 @@ It may be helpful to **categorize different types of failing feedback** into sep -### 5. User Provided Feedback +### User Provided Feedback In addition to leaving feedback from the developer's side, you can also set up your LLM to receive user feedback with just one line of code. Here's how to set it up: @@ -374,11 +362,3 @@ class MedicalAppointmentSystem(): :::info **Balancing user satisfaction with the level of detail in feedback** is essential. For instance, while we provide a rating scale from 1 to 5, we simplify it into a binary option: whether the user was satisfied or not. ::: - -## Guardrails - -Guardrails provide a way to safeguard your LLM against generating harmful or undesirable responses. They work by applying fast evaluation metrics to monitor and prevent the generation of unsafe or inappropriate outputs, especially in response to harmful inputs. - -:::info -**Guardrails** is an enterprise feature. For more information, please contact support@confident-ai.com. -::: \ No newline at end of file diff --git a/docs/docs/tutorial-production-monitoring.mdx b/docs/docs/tutorial-production-monitoring.mdx index 09e22acb7..28410460f 100644 --- a/docs/docs/tutorial-production-monitoring.mdx +++ b/docs/docs/tutorial-production-monitoring.mdx @@ -8,11 +8,11 @@ sidebar_label: Monitoring LLMs in Production While we've thoroughly tested our medical chatbot, it's absolutely necessary to **continue monitoring and evaluating your LLM applications post-production**. This is crucial for improving your LLM application and identifying bugs as well as areas of improvement. Confident AI offers a complete suite of features to help you and your team easily monitor your LLMs in production, including: -- [Response Monitoring](#confident-ai-llm-monitoring) -- [Online Evaluations](#confident-ai-llm-monitoring-evaluations) -- [LLM Tracing](#confident-ai-tracing) -- [Integrating Human Feedback](#confident-ai-human-feedback) -- [Placing Guardrails](#confident-ai-guardrails) +- [Response Monitoring](confident-ai-llm-monitoring) +- [Online Evaluations](confident-ai-llm-monitoring-evaluations) +- [LLM Tracing](confident-ai-tracing) +- [Integrating Human Feedback](confident-ai-human-feedback) +- [Placing Guardrails](confident-ai-guardrails) In this section, we'll be focusing on setting up our medical chatbot application for response monitoring and tracing. @@ -28,7 +28,7 @@ deepeval login ## Setting up using DeepEval -### 1. Setting up your Monitor Function +### Setting up your Monitor Function Let’s remind ourselves of the main interactive chat method within our `MedicalAppointmentSystem`. We’ll be enhancing this function to monitor real-time responses in production. @@ -92,7 +92,7 @@ In addition to logging mandatory parameters such as `event_name`, `model`, `inpu If your use case involves a chatbot and is conversational, logging `conversation_id` allows you to analyze, evaluate, and view entire conversational threads. ::: -### 2. Setting up Tracing +### Setting up Tracing Next, we’ll set up tracing for our medical chatbot. Since the medical diagnosis bot is built using LlamaIndex, integrating tracing is as simple as adding a **single line of code**. The same applies for LangChain-based LLM applications, making it extremely quick and easy to get started. @@ -111,7 +111,7 @@ Now, when you deploy your LLM application, you’ll gain full visibility into th No matter how you built your application—whether using another framework or from scratch—you can create **custom (and even hybrid) traces** in DeepEval. Explore this [complete guide to LLM Tracing](confident-ai-tracing) to learn how to set it up. ::: -### 3. Tracing and Monitor +### Tracing and Monitor DeepEval allows you to control what you monitor while leveraging its tracing integrations. To enable hybrid tracing for production response monitoring, you need to call the `monitor()` method at the end of the root trace block. @@ -155,7 +155,7 @@ class MedicalAppointmentSystem: ``` The LLM application is now fully set up for production monitoring and ready for deployment. In the next sections, we’ll explore how to make the most of Confident AI’s features for effective production monitoring. -### 4. Example Interaction +### Example Interaction Before diving into the platform, let’s first examine a _mock conversation_ between our medical chatbot and a hypothetical user, Jacob. Jacob is experiencing mild headaches and wants to schedule an appointment with a doctor to address his concerns. We’ll use the Observatory to analyze and review this conversation in the upcoming sections. @@ -227,7 +227,7 @@ In this section, we’ll be specifically focusing on how to **view responses and /> -### 1. Filtering +### Filtering Let's start by filtering for all the responses from the conversation and model we want to analyze. We'll filter for the conversation thread set up earlier, `conversation_id="conversation111"`, and focus specifically on responses generated by the GPT-4o model. @@ -255,7 +255,7 @@ In production, deploying multiple versions of your chatbot simultaneously enable ::: -### 2. Inspecting Each Response +### Inspecting Each Response To inspect each monitored response in more detail, simply click on the corresponding row. The fields displayed in the dropdown **align with the parameters we chose to log** in our `monitor` function. Since we did not log token count or token usage, and this specific response did not prompt our chatbot’s RAG engine tool, the fields for token usage, cost, and retrieval are empty. @@ -282,7 +282,7 @@ To inspect each monitored response in more detail, simply click on the correspon You're also able to view **hyperparameters and custom data**, should you choose to log them, next to the *All Properties* tab and click *Inspect* to explore the test data in greater detail, including its traces. ::: -### 3. Detailed Inspection +### Detailed Inspection Clicking **Inspect** opens a side drawer where you can review your response in greater detail. You’ll find tabs for default parameters, hyperparameters, and custom data, as well as additional tabs for metrics and feedback. @@ -312,7 +312,7 @@ For now, let’s click on **View Trace** to see the full trace of the path our L -### 4. Viewing Traces +### Viewing Traces We’ll examine the trace of this specific interaction between Jacob and our medical chatbot.