Add RAG Tools

svilupp · Dec 23, 2023 · b4502c6 · b4502c6
2 parents 298fc91 + f937cf6
commit b4502c6
Show file tree

Hide file tree

Showing 40 changed files with 2,809 additions and 27 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- Experimental sub-module RAGTools providing basic Retrieval-Augmented Generation functionality. See `?RAGTools` for more information. It's all nested inside of `PromptingTools.Experimental.RAGTools` to signify that it might change in the future. Key functions are `build_index` and `airag`, but it also provides a suite to make evaluation easier (see `?build_qa_evals` and `?run_qa_evals` or just see the example `examples/building_RAG.jl`)
 
 ### Fixed
 - Stricter code parsing in `AICode` to avoid false positives (code blocks must end with "```\n" to catch comments inside text)

diff --git a/Project.toml b/Project.toml
@@ -12,21 +12,32 @@ OpenAI = "e9f21f70-7185-4079-aca2-91159181367c"
 PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
 Preferences = "21216c6a-2e73-6563-6e65-726566657250"
 
+[weakdeps]
+LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
+SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
+
+[extensions]
+RAGToolsExperimentalExt = ["SparseArrays", "LinearAlgebra"]
+
 [compat]
 Aqua = "0.7"
 Base64 = "<0.0.1, 1"
 HTTP = "1"
 JSON3 = "1"
+LinearAlgebra = "<0.0.1, 1"
 Logging = "<0.0.1, 1"
 OpenAI = "0.8.7"
 PrecompileTools = "1"
 Preferences = "1"
+SparseArrays = "<0.0.1, 1"
 Test = "<0.0.1, 1"
 julia = "1.9,1.10"
 
 [extras]
 Aqua = "4c88cf16-eb10-579e-8560-4a9242c79595"
+LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
+SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Aqua", "Test"]
+test = ["Aqua", "Test", "SparseArrays", "LinearAlgebra"]
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,5 +1,10 @@
 [deps]
+DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
+JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
+LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
 LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
 PromptingTools = "670122d1-24a8-4d70-bfce-740807c42192"
+SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
diff --git a/docs/generate_examples.jl b/docs/generate_examples.jl
@@ -8,4 +8,6 @@ output_dir = joinpath(@__DIR__, "src", "examples")
 filter!(endswith(".jl"), example_files)
 for fn in example_files
     Literate.markdown(fn, output_dir; execute = true)
-end
+end
+
+# TODO: change meta fields at the top of each file!
diff --git a/docs/make.jl b/docs/make.jl
@@ -1,13 +1,17 @@
 using PromptingTools
 using Documenter
+using SparseArrays, LinearAlgebra
+using PromptingTools.Experimental.RAGTools
+using JSON3, Serialization, DataFramesMeta
+using Statistics: mean
 
 DocMeta.setdocmeta!(PromptingTools,
     :DocTestSetup,
     :(using PromptingTools);
     recursive = true)
 
 makedocs(;
-    modules = [PromptingTools],
+    modules = [PromptingTools, PromptingTools.Experimental.RAGTools],
     authors = "J S <[email protected]> and contributors",
     repo = "https://github.com/svilupp/PromptingTools.jl/blob/{commit}{path}#{line}",
     sitename = "PromptingTools.jl",
@@ -24,9 +28,14 @@ makedocs(;
             "Various examples" => "examples/readme_examples.md",
             "Using AITemplates" => "examples/working_with_aitemplates.md",
             "Local models with Ollama.ai" => "examples/working_with_ollama.md",
+            "Building RAG Application" => "examples/building_RAG.md",
         ],
         "F.A.Q." => "frequently_asked_questions.md",
-        "Reference" => "reference.md",
+        "Reference" => [
+            "PromptingTools.jl" => "reference.md",
+            "Experimental Modules" => "reference_experimental.md",
+            "RAGTools" => "reference_ragtools.md",
+        ],
     ])
 
 deploydocs(;

diff --git a/docs/src/examples/building_RAG.md b/docs/src/examples/building_RAG.md
diff --git a/docs/src/reference_experimental.md b/docs/src/reference_experimental.md
@@ -0,0 +1,12 @@
+# Reference for Experimental Module
+
+Note: This module is experimental and may change in future releases. 
+The intention is for the functionality to be moved to separate packages over time.
+
+```@index
+Modules = [PromptingTools.Experimental]
+```
+
+```@autodocs
+Modules = [PromptingTools.Experimental]
+```
diff --git a/docs/src/reference_ragtools.md b/docs/src/reference_ragtools.md
@@ -0,0 +1,9 @@
+# Reference for RAGTools
+
+```@index
+Modules = [PromptingTools.Experimental.RAGTools]
+```
+
+```@autodocs
+Modules = [PromptingTools.Experimental.RAGTools]
+```
diff --git a/examples/building_RAG.jl b/examples/building_RAG.jl
@@ -0,0 +1,147 @@
+# # Building a Simple Retrieval-Augmented Generation (RAG) System with RAGTools
+
+# Let's build a Retrieval-Augmented Generation (RAG) chatbot, tailored to navigate and interact with the DataFrames.jl documentation. 
+# "RAG" is probably the most common and valuable pattern in Generative AI at the moment.
+
+# If you're not familiar with "RAG", start with this [article](https://towardsdatascience.com/add-your-own-data-to-an-llm-using-retrieval-augmented-generation-rag-b1958bf56a5a).
+
+## Imports
+using LinearAlgebra, SparseArrays
+using PromptingTools
+## Note: RAGTools is still experimental and will change in the future. Ideally, they will be cleaned up and moved to a dedicated package
+using PromptingTools.Experimental.RAGTools
+using JSON3, Serialization, DataFramesMeta
+using Statistics: mean
+const PT = PromptingTools
+const RT = PromptingTools.Experimental.RAGTools
+
+# ## RAG in Two Lines
+
+# Let's put together a few text pages from DataFrames.jl docs. 
+# Simply go to [DataFrames.jl docs](https://dataframes.juliadata.org/stable/) and copy&paste a few pages into separate text files. Save them in the `examples/data` folder (see some example pages provided). Ideally, delete all the noise (like headers, footers, etc.) and keep only the text you want to use for the chatbot. Remember, garbage in, garbage out!
+
+files = [
+    joinpath("examples", "data", "database_style_joins.txt"),
+    joinpath("examples", "data", "what_is_dataframes.txt"),
+]
+## Build an index of chunks, embed them, and create a lookup index of metadata/tags for each chunk
+index = build_index(files; extract_metadata = false)
+
+# Let's ask a question
+## Embeds the question, finds the closest chunks in the index, and generates an answer from the closest chunks
+answer = airag(index; question = "I like dplyr, what is the equivalent in Julia?")
+
+# First RAG in two lines? Done!
+#
+# What does it do?
+# - `build_index` will chunk the documents into smaller pieces, embed them into numbers (to be able to judge the similarity of chunks) and, optionally, create a lookup index of metadata/tags for each chunk)
+#   - `index` is the result of this step and it holds your chunks, embeddings, and other metadata! Just show it :)
+# - `airag` will
+#   - embed your question
+#   - find the closest chunks in the index (use parameters `top_k` and `minimum_similarity` to tweak the "relevant" chunks)
+#   - [OPTIONAL] extracts any potential tags/filters from the question and applies them to filter down the potential candidates (use `extract_metadata=true` in `build_index`, you can also provide some filters explicitly via `tag_filter`)
+#   - [OPTIONAL] re-ranks the candidate chunks (define and provide your own `rerank_strategy`, eg Cohere ReRank API)
+#   - build a context from the closest chunks (use `chunks_window_margin` to tweak if we include preceding and succeeding chunks as well, see `?build_context` for more details)
+# - generate an answer from the closest chunks (use `return_context=true` to see under the hood and debug your application)
+
+# You should save the index for later to avoid re-embedding / re-extracting the document chunks!
+serialize("examples/index.jls", index)
+index = deserialize("examples/index.jls")
+
+# # Evaluations
+# However, we want to evaluate the quality of the system. For that, we need a set of questions and answers.
+# Ideally, we would hand-craft a set of high quality Q&A pairs. However, this is time consuming and expensive.
+# Let's generate them from the chunks in our index!
+
+# ## Generate Q&A pairs
+
+# We need to provide: chunks and sources (filepaths for future reference)
+evals = build_qa_evals(RT.chunks(index),
+    RT.sources(index);
+    instructions = "None.",
+    verbose = true);
+## Info: Q&A Sets built! (cost: $0.143) -- not bad!
+
+# > [!TIP]
+# > In practice, you would review each item in this golden evaluation set (and delete any generic/poor questions). 
+# > It will determine the future success of your app, so you need to make sure it's good!
+
+## Save the evals for later
+JSON3.write("examples/evals.json", evals)
+evals = JSON3.read("examples/evals.json", Vector{RT.QAEvalItem});
+
+# ## Explore one Q&A pair
+# Let's explore one evals item -- it's not the best but gives you the idea!
+#
+evals[1]
+
+# ## Evaluate this Q&A pair
+
+# Let's evaluate this QA item with a "judge model" (often GPT-4 is used as a judge).
+
+## Note: that we used the same question, but generated a different context and answer via `airag`
+msg, ctx = airag(index; evals[1].question, return_context = true);
+
+## ctx is a RAGContext object that keeps all intermediate states of the RAG pipeline for easy evaluation
+judged = aiextract(:RAGJudgeAnswerFromContext;
+    ctx.context,
+    ctx.question,
+    ctx.answer,
+    return_type = RT.JudgeAllScores)
+judged.content
+## Dict{Symbol, Any} with 7 entries:
+##   :final_rating => 4.8
+##   :clarity      => 5
+##   :completeness => 5
+##   :relevance    => 5
+##   :consistency  => 4
+##   :helpfulness  => 5
+##   :rationale    => "The answer is highly relevant to the user's question, as it provides a comprehensive list of frameworks that are compared with DataFrames.jl. The answer is complete, covering all 
+
+# We can also run the whole evaluation in a function (a few more metrics are available):
+x = run_qa_evals(evals[10], ctx;
+    parameters_dict = Dict(:top_k => 3), verbose = true, model_judge = "gpt4t")
+
+# Fortunately, we don't have to do this one by one -- let's evaluate all our Q&A pairs at once.
+
+# ## Evaluate the whole set
+
+# Let's run each question & answer through our eval loop in async (we do it only for the first 10 to save time). See the `?airag` for which parameters you can tweak, eg, `top_k`
+
+results = asyncmap(evals[1:10]) do qa_item
+    ## Generate an answer -- often you want the model_judge to be the highest quality possible, eg, "GPT-4 Turbo" (alias "gpt4t)
+    msg, ctx = airag(index; qa_item.question, return_context = true,
+        top_k = 3, verbose = false, model_judge = "gpt4t")
+    ## Evaluate the response
+    ## Note: you can log key parameters for easier analysis later
+    run_qa_evals(qa_item, ctx; parameters_dict = Dict(:top_k => 3), verbose = false)
+end
+## Note that the "failed" evals can show as "nothing", so make sure to handle them.
+results = filter(x -> !isnothing(x.answer_score), results);
+
+# Note: You could also use the vectorized version `results = run_qa_evals(evals)` to evaluate all items at once.
+
+## Let's take a simple average to calculate our score
+@info "RAG Evals: $(length(results)) results, Avg. score: $(round(mean(x->x.answer_score, results);digits=1)), Retrieval score: $(100*round(Int,mean(x->x.retrieval_score,results)))%"
+## [ Info: RAG Evals: 10 results, Avg. score: 4.6, Retrieval score: 100%
+
+# Note: The retrieval score is 100% only because we have two small documents and running on 10 items only. In practice, you would have a much larger document set and a much larger eval set, which would result in a more representative retrieval score.
+
+# You can also analyze the results in a DataFrame:
+
+df = DataFrame(results)
+first(df, 5)
+
+# We're done for today!
+
+# # What would we do next?
+# - Review your evaluation golden data set and keep only the good items
+# - Play with the chunk sizes (max_length in build_index) and see how it affects the quality
+# - Explore using metadata/key filters (`extract_metadata=true` in build_index)
+# - Add filtering for semantic similarity (embedding distance) to make sure we don't pick up irrelevant chunks in the context
+# - Use multiple indices or a hybrid index (add a simple BM25 lookup from TextAnalysis.jl)
+# - Data processing is the most important step - properly parsed and split text could make wonders
+# - Add re-ranking of context (see `rerank` function, you can use Cohere ReRank API)`)
+# - Improve the question embedding (eg, rephrase it, generate hypothetical answers and use them to find better context)
+#
+# ... and much more! See some ideas in [Anyscale RAG tutorial](https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1)