Document Q&A is designed to respond comprehensively to questions posed about the provided document, regardless of the section from which the questions originate.
-
To create a Hugging face user access tokens or use an existing one, visit: https://huggingface.co/settings/tokens.
-
Create a new environment:
conda create -p genai python==3.9 -y
-
Activate the environment:
conda activate genai
-
Install the requirements:
pip install -r requirements.txt
-
Run the Streamlit application:
streamlit run app.py
- Upload one or more PDF files. It will take little time to load. At backend, it will process, read, chunk and index the pdf files.
- We can able to see the preview of the content. Expand to look into the content.
- Ask the question that we have to know from the documents.
Quick start: https://huggingface.co/spaces/susheel-1999/documentQA
Langchain is a framework for developing applications powered by language models. It enables applications that are context-aware and reason.
- Chunking process - It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
from langchain.text_splitter import RecursiveCharacterTextSplitter
Types of chunking:
i) Character Text Splitter - Splitting text based on the characters.
ii) Recursive Character Text character -Text is split based on sequences of characters. This method is particularly effective for retaining the structure of paragraphs and sentences.
iii) Document Based Splitter - Text is split based on the structure of documents. This approach caters to specific document formats, such as Python-based documents, HTML, markup, and more.
iv) Semantic Chunking - Aims to identify points in the text where sentence similarity varies significantly (potentially with a threshold while considering the following sentence). These identified points serve as separators for creating meaningful chunks. - Integration of Hugging Face Models and Embeddings - Langchain seamlessly incorporates and provides access to Hugging Face models and embeddings. Users can leverage the following functionalities.
Emebeddings:from langchain_community.embeddings import HuggingFaceEmbeddings
LLMs:from langchain_community.llms import HuggingFaceHub
- Integration of VectorDB - Langchain seamlessly incorporates and provides supports for many VectorDB.
FAISS - (Facebook AI Similarity Search) is a library developed by Facebook AI Research specifically for efficient similarity search and clustering of dense vectors. It's particularly useful in applications involving large-scale vector search, where you need to find the nearest neighbors of a given vector among a massive dataset. It have support to various index type, optimized for both CPU and GPU and designed to handle billions of vectors efficiently, making it highly scalable. Example usecases: Image search, document search, recommendation engine and etc.
from langchain_community.vectorstores import FAISS
- Schema - Class for storing a piece of text and associated metadata. TO conver
from langchain.schema import Document
- Prompt Template - A template of a prompt can be easily designed with the help of the PromptTemplate class.
from langchain.prompts import PromptTemplate
- LLM chain - The LLMChain class is used to execute the PromptTemplate.
from langchain.chains import LLMChain
Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes we can build and deploy powerful data apps.
- Session State - Session State is a way to share variables between reruns, for each user session.
- Technique 1: Stuff
Uses ALL of the text from the documents in the prompt. It actually doesn’t work in Scenario where the data exceeds the token limit and causes rate-limiting errors. - Technique 2: map_reduce
It separates texts into batches, feeds each batch with the question to LLM separately, and comes up with the final answer based on the answers from each batch. - Technique 3: refine
It separates texts into batches, feeds the first batch to LLM, and feeds the answer and the second batch to LLM. It refines the answer by going through all the batches. - Technique 4: map-rerank
It separates texts into batches, feeds each batch to LLM, returns a score of how fully it answers the question, and comes up with the final answer based on the high-scored answers from each batch.
One issue with using Technique 1, 2, 3, and 4 are that it can be very costly because you are feeding more text and multiple hits to OpenAI API and the API is charged by the number of tokens. A better solution is RAG (Retrieval Augmented Generation) which retrieve relevant text chunks first and only use the relevant text chunks in the language model. - Technique 5: RAG
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.
Steps involved:
i. Document Indexing into VectorDB
ii. Data Retriever
iii. Data Augmentation and Prompt Engineering
iv. Querying
Langchain - https://python.langchain.com/docs/get_started/introduction
OpenAI - https://platform.openai.com/docs/introduction
Streamlit - https://docs.streamlit.io/library/api-reference/session-state