Skip to content

Commit

Permalink
Add RAG Tools
Browse files Browse the repository at this point in the history
Add RAG Tools
  • Loading branch information
svilupp authored Dec 23, 2023
2 parents dddb14c + 9a569bc commit 37ca3bf
Show file tree
Hide file tree
Showing 40 changed files with 2,809 additions and 27 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- Experimental sub-module RAGTools providing basic Retrieval-Augmented Generation functionality. See `?RAGTools` for more information. It's all nested inside of `PromptingTools.Experimental.RAGTools` to signify that it might change in the future. Key functions are `build_index` and `airag`, but it also provides a suite to make evaluation easier (see `?build_qa_evals` and `?run_qa_evals` or just see the example `examples/building_RAG.jl`)

### Fixed
- Stricter code parsing in `AICode` to avoid false positives (code blocks must end with "```\n" to catch comments inside text)
Expand Down
13 changes: 12 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,32 @@ OpenAI = "e9f21f70-7185-4079-aca2-91159181367c"
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
Preferences = "21216c6a-2e73-6563-6e65-726566657250"

[weakdeps]
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[extensions]
RAGToolsExperimentalExt = ["SparseArrays", "LinearAlgebra"]

[compat]
Aqua = "0.7"
Base64 = "<0.0.1, 1"
HTTP = "1"
JSON3 = "1"
LinearAlgebra = "<0.0.1, 1"
Logging = "<0.0.1, 1"
OpenAI = "0.8.7"
PrecompileTools = "1"
Preferences = "1"
SparseArrays = "<0.0.1, 1"
Test = "<0.0.1, 1"
julia = "1.9,1.10"

[extras]
Aqua = "4c88cf16-eb10-579e-8560-4a9242c79595"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Aqua", "Test"]
test = ["Aqua", "Test", "SparseArrays", "LinearAlgebra"]
5 changes: 5 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
[deps]
DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
PromptingTools = "670122d1-24a8-4d70-bfce-740807c42192"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
4 changes: 3 additions & 1 deletion docs/generate_examples.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ output_dir = joinpath(@__DIR__, "src", "examples")
filter!(endswith(".jl"), example_files)
for fn in example_files
Literate.markdown(fn, output_dir; execute = true)
end
end

# TODO: change meta fields at the top of each file!
13 changes: 11 additions & 2 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
using PromptingTools
using Documenter
using SparseArrays, LinearAlgebra
using PromptingTools.Experimental.RAGTools
using JSON3, Serialization, DataFramesMeta
using Statistics: mean

DocMeta.setdocmeta!(PromptingTools,
:DocTestSetup,
:(using PromptingTools);
recursive = true)

makedocs(;
modules = [PromptingTools],
modules = [PromptingTools, PromptingTools.Experimental.RAGTools],
authors = "J S <[email protected]> and contributors",
repo = "https://github.com/svilupp/PromptingTools.jl/blob/{commit}{path}#{line}",
sitename = "PromptingTools.jl",
Expand All @@ -24,9 +28,14 @@ makedocs(;
"Various examples" => "examples/readme_examples.md",
"Using AITemplates" => "examples/working_with_aitemplates.md",
"Local models with Ollama.ai" => "examples/working_with_ollama.md",
"Building RAG Application" => "examples/building_RAG.md",
],
"F.A.Q." => "frequently_asked_questions.md",
"Reference" => "reference.md",
"Reference" => [
"PromptingTools.jl" => "reference.md",
"Experimental Modules" => "reference_experimental.md",
"RAGTools" => "reference_ragtools.md",
],
])

deploydocs(;
Expand Down
228 changes: 228 additions & 0 deletions docs/src/examples/building_RAG.md

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions docs/src/reference_experimental.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Reference for Experimental Module

Note: This module is experimental and may change in future releases.
The intention is for the functionality to be moved to separate packages over time.

```@index
Modules = [PromptingTools.Experimental]
```

```@autodocs
Modules = [PromptingTools.Experimental]
```
9 changes: 9 additions & 0 deletions docs/src/reference_ragtools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Reference for RAGTools

```@index
Modules = [PromptingTools.Experimental.RAGTools]
```

```@autodocs
Modules = [PromptingTools.Experimental.RAGTools]
```
147 changes: 147 additions & 0 deletions examples/building_RAG.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# # Building a Simple Retrieval-Augmented Generation (RAG) System with RAGTools

# Let's build a Retrieval-Augmented Generation (RAG) chatbot, tailored to navigate and interact with the DataFrames.jl documentation.
# "RAG" is probably the most common and valuable pattern in Generative AI at the moment.

# If you're not familiar with "RAG", start with this [article](https://towardsdatascience.com/add-your-own-data-to-an-llm-using-retrieval-augmented-generation-rag-b1958bf56a5a).

## Imports
using LinearAlgebra, SparseArrays
using PromptingTools
## Note: RAGTools is still experimental and will change in the future. Ideally, they will be cleaned up and moved to a dedicated package
using PromptingTools.Experimental.RAGTools
using JSON3, Serialization, DataFramesMeta
using Statistics: mean
const PT = PromptingTools
const RT = PromptingTools.Experimental.RAGTools

# ## RAG in Two Lines

# Let's put together a few text pages from DataFrames.jl docs.
# Simply go to [DataFrames.jl docs](https://dataframes.juliadata.org/stable/) and copy&paste a few pages into separate text files. Save them in the `examples/data` folder (see some example pages provided). Ideally, delete all the noise (like headers, footers, etc.) and keep only the text you want to use for the chatbot. Remember, garbage in, garbage out!

files = [
joinpath("examples", "data", "database_style_joins.txt"),
joinpath("examples", "data", "what_is_dataframes.txt"),
]
## Build an index of chunks, embed them, and create a lookup index of metadata/tags for each chunk
index = build_index(files; extract_metadata = false)

# Let's ask a question
## Embeds the question, finds the closest chunks in the index, and generates an answer from the closest chunks
answer = airag(index; question = "I like dplyr, what is the equivalent in Julia?")

# First RAG in two lines? Done!
#
# What does it do?
# - `build_index` will chunk the documents into smaller pieces, embed them into numbers (to be able to judge the similarity of chunks) and, optionally, create a lookup index of metadata/tags for each chunk)
# - `index` is the result of this step and it holds your chunks, embeddings, and other metadata! Just show it :)
# - `airag` will
# - embed your question
# - find the closest chunks in the index (use parameters `top_k` and `minimum_similarity` to tweak the "relevant" chunks)
# - [OPTIONAL] extracts any potential tags/filters from the question and applies them to filter down the potential candidates (use `extract_metadata=true` in `build_index`, you can also provide some filters explicitly via `tag_filter`)
# - [OPTIONAL] re-ranks the candidate chunks (define and provide your own `rerank_strategy`, eg Cohere ReRank API)
# - build a context from the closest chunks (use `chunks_window_margin` to tweak if we include preceding and succeeding chunks as well, see `?build_context` for more details)
# - generate an answer from the closest chunks (use `return_context=true` to see under the hood and debug your application)

# You should save the index for later to avoid re-embedding / re-extracting the document chunks!
serialize("examples/index.jls", index)
index = deserialize("examples/index.jls")

# # Evaluations
# However, we want to evaluate the quality of the system. For that, we need a set of questions and answers.
# Ideally, we would hand-craft a set of high quality Q&A pairs. However, this is time consuming and expensive.
# Let's generate them from the chunks in our index!

# ## Generate Q&A pairs

# We need to provide: chunks and sources (filepaths for future reference)
evals = build_qa_evals(RT.chunks(index),
RT.sources(index);
instructions = "None.",
verbose = true);
## Info: Q&A Sets built! (cost: $0.143) -- not bad!

# > [!TIP]
# > In practice, you would review each item in this golden evaluation set (and delete any generic/poor questions).
# > It will determine the future success of your app, so you need to make sure it's good!

## Save the evals for later
JSON3.write("examples/evals.json", evals)
evals = JSON3.read("examples/evals.json", Vector{RT.QAEvalItem});

# ## Explore one Q&A pair
# Let's explore one evals item -- it's not the best but gives you the idea!
#
evals[1]

# ## Evaluate this Q&A pair

# Let's evaluate this QA item with a "judge model" (often GPT-4 is used as a judge).

## Note: that we used the same question, but generated a different context and answer via `airag`
msg, ctx = airag(index; evals[1].question, return_context = true);

## ctx is a RAGContext object that keeps all intermediate states of the RAG pipeline for easy evaluation
judged = aiextract(:RAGJudgeAnswerFromContext;
ctx.context,
ctx.question,
ctx.answer,
return_type = RT.JudgeAllScores)
judged.content
## Dict{Symbol, Any} with 7 entries:
## :final_rating => 4.8
## :clarity => 5
## :completeness => 5
## :relevance => 5
## :consistency => 4
## :helpfulness => 5
## :rationale => "The answer is highly relevant to the user's question, as it provides a comprehensive list of frameworks that are compared with DataFrames.jl. The answer is complete, covering all

# We can also run the whole evaluation in a function (a few more metrics are available):
x = run_qa_evals(evals[10], ctx;
parameters_dict = Dict(:top_k => 3), verbose = true, model_judge = "gpt4t")

# Fortunately, we don't have to do this one by one -- let's evaluate all our Q&A pairs at once.

# ## Evaluate the whole set

# Let's run each question & answer through our eval loop in async (we do it only for the first 10 to save time). See the `?airag` for which parameters you can tweak, eg, `top_k`

results = asyncmap(evals[1:10]) do qa_item
## Generate an answer -- often you want the model_judge to be the highest quality possible, eg, "GPT-4 Turbo" (alias "gpt4t)
msg, ctx = airag(index; qa_item.question, return_context = true,
top_k = 3, verbose = false, model_judge = "gpt4t")
## Evaluate the response
## Note: you can log key parameters for easier analysis later
run_qa_evals(qa_item, ctx; parameters_dict = Dict(:top_k => 3), verbose = false)
end
## Note that the "failed" evals can show as "nothing", so make sure to handle them.
results = filter(x -> !isnothing(x.answer_score), results);

# Note: You could also use the vectorized version `results = run_qa_evals(evals)` to evaluate all items at once.

## Let's take a simple average to calculate our score
@info "RAG Evals: $(length(results)) results, Avg. score: $(round(mean(x->x.answer_score, results);digits=1)), Retrieval score: $(100*round(Int,mean(x->x.retrieval_score,results)))%"
## [ Info: RAG Evals: 10 results, Avg. score: 4.6, Retrieval score: 100%

# Note: The retrieval score is 100% only because we have two small documents and running on 10 items only. In practice, you would have a much larger document set and a much larger eval set, which would result in a more representative retrieval score.

# You can also analyze the results in a DataFrame:

df = DataFrame(results)
first(df, 5)

# We're done for today!

# # What would we do next?
# - Review your evaluation golden data set and keep only the good items
# - Play with the chunk sizes (max_length in build_index) and see how it affects the quality
# - Explore using metadata/key filters (`extract_metadata=true` in build_index)
# - Add filtering for semantic similarity (embedding distance) to make sure we don't pick up irrelevant chunks in the context
# - Use multiple indices or a hybrid index (add a simple BM25 lookup from TextAnalysis.jl)
# - Data processing is the most important step - properly parsed and split text could make wonders
# - Add re-ranking of context (see `rerank` function, you can use Cohere ReRank API)`)
# - Improve the question embedding (eg, rephrase it, generate hypothetical answers and use them to find better context)
#
# ... and much more! See some ideas in [Anyscale RAG tutorial](https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1)
Loading

0 comments on commit 37ca3bf

Please sign in to comment.