Start by filtering the Wikipedia dataset to match specific date and length criteria.
./dataset/filter_wikipedia_dataset.py --output_dir data
Process the filtered dataset to prepare metadata.
./dataset/process_wikipedia_parser.py --data_path data/Wikipedia2023-len-1k-to-3k/train --output_dir data/metadata
To enrich your dataset with high-quality references, collect links either from the reference section of Wikipedia articles or by using Google's API. The collected links should be formatted as follows:
{
"title": "The Title of the Article",
"url": "https://example.com/link-to-the-article",
"source": "Wikipedia/Google"
}
This step is crucial for gathering comprehensive background information and supporting materials for the dataset. Store these links in a structured format, as they will be used in subsequent steps for scraping and analysis.
Scrape the collected links to gather the data.
./dataset/scrape_links.py --input_dir data/search_link --output_dir data/scraped_data
After initial filtering and data cleaning, it's essential to organize the dataset for further processing and analysis. The data should be managed in a structured format as follows:
{
"doc_id": "Unique Document Identifier",
"content": "The full text content of the document"
}
To facilitate more efficient processing and retrieval, large documents should be chunked into segments that can be processed individually by the system:
./dataset/chunk_docs.py --input_dir data/doc --output_dir data/doc/chunked
Our framework leverages FastChat in conjunction with open-source large language models (LLMs) for generating text. Additionally, for testing and comparison purposes, we utilize OpenAI's API to generate text using GPT-3.5.
The first step in the text generation process is to create prompts that will guide the model in producing the desired content. These prompts are crafted to encapsulate the context and specify the information or narrative style we aim to generate.
To generate prompts, use the following script:
./generation/generate_prompts.py
FastChat is employed to generate text responses based on prompts derived from the dataset. Before generating responses, it's crucial to prepare outlines for the RRPR (Rapid Response Preparation Routine) process. These outlines help structure the generation process and ensure that the responses are organized and relevant.
Outlines are generated in the following format, capturing the structure of the content to be generated:
{
"pageid1": ["section_name1", "section_name2", "..."],
"...":"..."
}
For efficient and accurate retrieval of relevant documents, our framework adopts the Dense Passage Retrieval (DPR) methodology.
The initial step in the DPR process involves generating embeddings for the documents. These embeddings represent documents in a high-dimensional vector space, enabling the calculation of relevance scores between documents and queries.
To generate context embeddings, run the following command:
./retrieval/generate_context_embedding.py --metadata_dir data/metadata --docs_dir data/doc/chunked --embeddings_dir dpr_context_embeddings
With the context embeddings generated, the next step is to retrieve the top-k documents related to a given query or set of queries. This is achieved by calculating similarity scores between the query embeddings and document embeddings, typically using the dot product as a measure of similarity.
To retrieve documents using DPR, execute the following command:
./retrieve_with_dpr.py --metadata_dir data/metadata --docs_dir data/doc/chunked --embeddings_dir dpr_context_embeddings --outline_file vicuna-7b_outline.json --docs_num 50 --output_file top-50-dpr-vicuna-7b.json
The result provides a ranked list of documents based on their relevance to the query.
For Evaluation,make sure you data is organized in the form of
{
"text": "the generated wikipedia,use ==section== to indicate the section name, and [] after a sentence to indicate the cited chunks."
"retrieve": "a list of chunks to be cited. "
}
Each json files has the same name as the file in data. Enter the metrics folder and execute the following command
./metrics.py --path /path_to_your_generation
./nli.py --path /path_to_your_generation
./scores.py --path /path_to_your_generation