Template for a new AutoRAG project
pip install -r requirements.txt
Running the above command will automatically install AutoRAG.
First, set the OPENAI_API_KEY
environment variable directly or create a .env
file and input it there.
To use AutoRAG, you first need to create a RAG evaluation dataset. Follow the steps below to create and use the dataset yourself.
- Check the original documents in
raw_docs
. In this tutorial, we will use three PDF documents. - Run
run_parse.py
. This file allows you to execute parsing methods specified inconfig/parse.yaml
and compare the results.
python make_parse.py
- In the
parsed_raw
folder, you will find several parquet files within the trial folder (numbered folders). These are the parsed results. Load them usingpandas
to inspect them directly. - Execute
run_chunk.py
to perform chunking using various methods. You can check the chunking methods inconfig/chunk.yaml
. You need to set the raw file at this point.
python run_chunk.py --raw_path ./parsed_raw/0/2.parquet
- After execution, check the
chunked_corpus
folder for the various chunked files created using different chunking methods. - Now, run the
make_qa.py
file. You need to set the raw file used for chunk creation and the chunk file to be used. Choose an appropriate chunk file, and you can generate a QA dataset using other chunk files later. You don't need to generate questions again. Refer to the update_corpus feature explained later.
python make_qa.py --raw_path ./parsed_raw/0/5.parquet --chunk_path ./chunked_corpus/0/3.parquet --qa_size 5
- Check the
generated_qa.parquet
andgenerated_corpus.parquet
files created in thedata
folder.
- Copy the
.env.template
file to create a.env
file and save it. Be sure to input your OpenAI API key in this file. - Run main.py as shown below to start AutoRAG.
python3 main.py --config ./config/tutorial.yaml
- Once the benchmark folder is created, you can check the results there.
- Create a
benchmark
folder. - Set the
OPENAI_API_KEY
as an environment variable.export OPENAI_API_KEY=sk-xxxx
- Execute the CLI command below to start AutoRAG optimization.
autorag evaluate --qa_data_path ./data/qa.parquet --corpus_data_path ./data/corpus.parquet \
--config ./config/tutorial.yaml --project_dir ./benchmark
To run with the dataset created in the dataset tutorial, replace corpus.parquet
with corpus_new.parquet
and qa.parquet
with qa_new.parquet
.
4. Once the benchmark folder is created, you can check the results there.
Run the command below to load the dashboard. You can easily review the results through the dashboard.
autorag dashboard --trial_dir ./benchmark/0
Run Streamlit to directly use the optimized RAG. Execute the command below.
autorag run_web --trial_path ./benchmark/0
This feature allows you to generate new QA files based on chunk corpora using the same raw file.
Try it as shown below.
from autorag.data.qa.schema import Raw, Corpus, QA
raw = Raw(initial_raw_df)
corpus = Corpus(initial_corpus_df, raw)
qa = QA(initial_qa_df, corpus)
new_qa = qa.update_corpus(Corpus(new_corpus_df, raw))