diff --git a/.env b/.env new file mode 100644 index 0000000..31b87d1 --- /dev/null +++ b/.env @@ -0,0 +1,2 @@ +VOLUME="/source/local-machine/dir:target/multi-container/app/dir" +# VOLUME="c:/Users/User/:/User" e.g. \ No newline at end of file diff --git a/.gitignore b/.gitignore index 9888957..2059c7a 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ flagged/ scripts/__pycache__ -docker/build_command.sh \ No newline at end of file +docker/__pycache__ +docker/flagged \ No newline at end of file diff --git a/.v0_1_1/README.md b/.v0_1_1/README.md new file mode 100644 index 0000000..bcae638 --- /dev/null +++ b/.v0_1_1/README.md @@ -0,0 +1,149 @@ +# everything-rag + +>_How was this README generated? Levearaging the power of AI with **reAIdme**, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf._ +_Go and give it a try [here](https://hf.co/chat/assistant/660d9a4f590a7924eed02a32)!_ 🤖 + +
+ GitHub top language + GitHub commit activity + Static Badge + Static Badge + Static Badge + Static Badge +
+ Example chat +

Example chat with everything-rag, mediated by google/flan-t5-base

+
+
+ + +### Table of Contents + +0. [TL;DR](#tldr) +1. [Introduction](#introduction) +2. [Inspiration](#inspiration) +2. [Getting Started](#getting-started) +3. [Using the Chatbot](#using-the-chatbot) +4. [Troubleshooting](#troubleshooting) +5. [Contributing](#contributing) +6. [Upcoming features](#upcoming-features) +7. [References](#reference) + +## TL;DR + +* This documentation is soooooo long, I want to get my hands dirty!!! + >You can try out everything-rag the [dedicated HuggingFace space](https://huggingface.co/spaces/as-cle-bert/everything-rag), based on google/flan-t5-large. + +
+ +
+ +## Introduction + +Introducing **everything-rag**, your fully customizable and local chatbot assistant! 🤖 + +With everything-rag, you can: + +1. Use virtually any LLM you want: Switch between different LLMs like _gemma-7b_ or _llama-7b_ to suit your needs. +2. Use your own data: everything-rag can work with any data you provide, whether it's a PDF about data sciences or a document about pallas' cats!🐈 +3. Enjoy 100% local and 100% free functionality: No need for hosted APIs or pay-as-you-go services. everything-rag is completely free to use and runs on your desktop. Plus, with the chat_history functionality in ConversationalRetrievalChain, you can easily retrieve and review previous conversations with your chatbot, making it even more convenient to use. + +While everything-rag offers many benefits, there are a couple of limitations to keep in mind: + +1. Performance-critical tasks: Loading large models (>1~2 GB) and generating text can be resource-intensive, so it's recommended to have at least 16GB RAM and 4 CPU cores for optimal performance. +2. Small LLMs can still allucinate: While large LLMs like _gemma-7b_ and _llama-7b_ tend to produce better results, smaller models like _openai-community/gpt2_ can still produce suboptimal responses in certain situations. + +In summary, everything-rag is a simple, customizable, and local chatbot assistant that offers a wide range of features and capabilities. By leveraging the power of RAG, everything-rag offers a unique and flexible chatbot experience that can be tailored to your specific needs and preferences. Whether you're looking for a simple chatbot to answer basic questions or a more advanced conversational AI to engage with your users, everything-rag has got you covered.😊 + +## Inspiration + +This project is a humble and modest carbon-copy of its main and true inspirations, i.e. [Jan.ai](https://jan.ai/), [Cheshire Cat AI](https://cheshirecat.ai/), [privateGPT](https://privategpt.io/) and many other projects that focus on making LLMs (and AI in general) open-source and easily accessible to everyone. + +## Getting Started + +You can do two things: + +- Play with generation on [Kaggle](https://www.kaggle.com/code/astrabertelli/gemma-for-datasciences) +- Clone this repository, head over to [the python script](./scripts/gemma_for_datasciences.py) and modify everything to your needs! +- Docker installation (🥳**FULLY IMPLEMENTED**): you can install everything-rag through docker image and running it thanks do Docker by following these really simple commands: + +```bash +docker pull ghcr.io/astrabert/everything-rag:latest +docker run -p 7860:7860 everything-rag:latest -m microsoft/phi-2 -t text-generation +``` +- **IMPORTANT NOTE**: running the script within `docker run` does not log the port on which the app is running until you press `Ctrl+C`, but in that moment it also interrupt the execution! The app will run on port `0.0.0.0:7860` (or `localhost:7860` if your browser is Windows-based), so just make sure to open your browser on that port and to refresh it after 30s to 1 or 2 mins, when the model and the tokenizer should be loaded and the app should be ready to work! + +- As you can see, you just need to specify the LLM model and its task (this is mandatory). Keep in mind that, for what concerns v0.1.1, everything-rag supports only text-generation and text2text-generation. For these two tasks, you can use virtually *any* model from HuggingFace Hub: the sole recommendation is to watch out for your disk space, RAM and CPU power, LLMs can be quite resource-consuming! + +## Using the Chatbot + +### GUI + +The chatbot has a brand-new GradIO-based interface that runs on local server. You can interact by uploading directly your pdf files and/or sending messages, all by running: + +```bash +python3 scripts/chat.py -m provider/modelname -t task +``` + +The suggested workflow is, nevertheless, the one that exploits Docker. + +### Code breakdown - notebook + +Everything is explained in [the dedicated notebook](./scripts/gemma-for-datasciences.ipynb), but here's a brief breakdown of the code: + +1. The first section imports the necessary libraries, including Hugging Face Transformers, langchain-community, and tkinter. +2. The next section installs the necessary dependencies, including the gemma-2b model, and defines some useful functions for making the LLM-based data science assistant work. +3. The create_a_persistent_db function creates a persistent database from a PDF file, using the PyPDFLoader to split the PDF into smaller chunks and the Hugging Face embeddings to transform the text into numerical vectors. The resulting database is stored in a LocalFileStore. +4. The just_chatting function implements a chat system using the Hugging Face model and the persistent database. It takes a query, tokenizes it, and passes it to the model to generate a response. The response is then returned as a dictionary of strings. +5. The chat_gui class defines a simple chat GUI that displays the chat history and allows the user to input queries. The send_message function is called when the user presses the "Send" button, and it sends the user's message to the just_chatting function to get a response. +6. The script then creates a root Tk object and instantiates a ChatGUI object, which starts the main loop. + +Et voilà, your chatbot is up and running!🦿 + +## Troubleshooting + +### Common Issues Q&A + +* Q: The chatbot is not responding😭 + > A: Make sure that the PDF document is in the specified path and that the database has been created successfully. +* Q: The chatbot is taking soooo long🫠 + > A: This is quite common with resource-limited environments that deal with too large or too small models: large models require **at least** 32 GB RAM and >8 core CPU, whereas small model can easily be allucinating and producing responses that are endless repetitions of the same thing! Check *penalty_score* parameter to avoid this. **try rephrasing the query and be as specific as possible** +* Q: My model is allucinating and/or repeating the same sentence over and over again😵‍💫 + > A: This is quite common with small or old models: check *penalty_score* and *temperature* parameter to avoid this. +* Q: The chatbot is giving incorrect/non-meaningful answers🤥 + >A: Check that the PDF document is relevant and up-to-date. Also, **try rephrasing the query and be as specific as possible** +* Q: An error occurred while generating the answer💔 + >A: This frequently occurs when your (small) LLM has a limited maximum hidden size (generally 512 or 1024) and the context that the retrieval-augmented chain produces goes beyond that maximum. You could, potentially, modify the configuration of the model, but this would mean dramatically increase its resource consumption, and your small laptop is not prepared to take it, trust me!!! A solution, if you have enough RAM and CPU power, is to switch to larger LLMs: they do not have problems in this sense. + +## Upcoming features🚀 + +- [ ] Multi-lingual support (expected for **version 0.2.0**) + +- [ ] More text-based tasks: question answering, summarisation (expected for **version 0.3.0**) + +- [ ] Computer vision: Image-to-text, image generation, image segmentation... (expected for **version 1.0.0**) + +## Contributing + + +Contributions are welcome! If you would like to improve the chatbot's functionality or add new features, please fork the repository and submit a pull request. + +## Reference + + +* [Hugging Face Transformers](https://github.com/huggingface/transformers) +* [Langchain-community](https://github.com/langchain-community/langchain-community) +* [Tkinter](https://docs.python.org/3/library/tkinter.html) +* [PDF document about data science](https://www.kaggle.com/datasets/astrabertelli/what-is-datascience-docs) +* [GradIO](https://www.gradio.app/) + +## License + +This project is licensed under the Apache 2.0 License. + +If you use this work for your projects, please consider citing the author [Astra Bertelli](http://astrabert.vercel.app). diff --git a/data/WhatisDataScienceFinalMay162018.pdf b/.v0_1_1/data/WhatisDataScienceFinalMay162018.pdf similarity index 100% rename from data/WhatisDataScienceFinalMay162018.pdf rename to .v0_1_1/data/WhatisDataScienceFinalMay162018.pdf diff --git a/data/example_chat.png b/.v0_1_1/data/example_chat.png similarity index 100% rename from data/example_chat.png rename to .v0_1_1/data/example_chat.png diff --git a/.v0_1_1/docker/Dockerfile b/.v0_1_1/docker/Dockerfile new file mode 100644 index 0000000..26b8040 --- /dev/null +++ b/.v0_1_1/docker/Dockerfile @@ -0,0 +1,31 @@ +# Use an official Python runtime as a parent image +FROM python:3.10-slim-bookworm + +# Set the working directory in the container to /app +WORKDIR /app + +# Add the current directory contents into the container at /app +ADD . /app + +# Update and install system dependencies +RUN apt-get update && apt-get install -y \ + build-essential \ + libpq-dev \ + libffi-dev \ + libssl-dev \ + musl-dev \ + libxml2-dev \ + libxslt1-dev \ + zlib1g-dev \ + && rm -rf /var/lib/apt/lists/* + +# Install Python dependencies +RUN python3 -m pip cache purge +RUN python3 -m pip install --no-cache-dir -r requirements.txt + + +# Expose the port that the application will run on +EXPOSE 7860 + +# Set the entrypoint with a default command and allow the user to override it +ENTRYPOINT ["python3", "chat.py"] \ No newline at end of file diff --git a/.v0_1_1/docker/__pycache__/utils.cpython-310.pyc b/.v0_1_1/docker/__pycache__/utils.cpython-310.pyc new file mode 100644 index 0000000..a48daad Binary files /dev/null and b/.v0_1_1/docker/__pycache__/utils.cpython-310.pyc differ diff --git a/.v0_1_1/docker/build_command.sh b/.v0_1_1/docker/build_command.sh new file mode 100644 index 0000000..348792f --- /dev/null +++ b/.v0_1_1/docker/build_command.sh @@ -0,0 +1,11 @@ +docker buildx build \ +--label org.opencontainers.image.title=everything-rag \ +--label org.opencontainers.image.description='Introducing everything-rag, your fully customizable and local chatbot assistant!' \ +--label org.opencontainers.image.url=https://github.com/AstraBert/everything-rag \ +--label org.opencontainers.image.source=https://github.com/AstraBert/everything-rag --label org.opencontainers.image.version=0.1.7 \ +--label org.opencontainers.image.created=2024-04-07T12:39:11.393Z \ +--label org.opencontainers.image.licenses=Apache-2.0 \ +--platform linux/amd64 \ +--tag ghcr.io/astrabert/everything-rag:latest \ +--tag ghcr.io/astrabert/everything-rag:0.1.1 \ +--push . \ No newline at end of file diff --git a/docker/chat.py b/.v0_1_1/docker/chat.py similarity index 100% rename from docker/chat.py rename to .v0_1_1/docker/chat.py diff --git a/.v0_1_1/docker/requirements.txt b/.v0_1_1/docker/requirements.txt new file mode 100644 index 0000000..b022f1f --- /dev/null +++ b/.v0_1_1/docker/requirements.txt @@ -0,0 +1,10 @@ +langchain-community==0.0.13 +langchain==0.1.1 +pypdf==3.17.4 +sentence_transformers==2.2.2 +chromadb==0.4.22 +cryptography>=3.1 +gradio +transformers +trl +peft \ No newline at end of file diff --git a/.v0_1_1/docker/utils.py b/.v0_1_1/docker/utils.py new file mode 100644 index 0000000..d493764 --- /dev/null +++ b/.v0_1_1/docker/utils.py @@ -0,0 +1,172 @@ +from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline +import time +from langchain_community.llms import HuggingFacePipeline +from langchain.storage import LocalFileStore +from langchain.embeddings import CacheBackedEmbeddings +from langchain_community.vectorstores import Chroma +from langchain.text_splitter import CharacterTextSplitter +from langchain_community.document_loaders import PyPDFLoader +from langchain_community.embeddings import HuggingFaceEmbeddings +from langchain.chains import ConversationalRetrievalChain +import os +from pypdf import PdfMerger +from argparse import ArgumentParser + + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +argparse.add_argument( + "-t", + "--task", + help="Task for the model: for now supported task are ['text-generation', 'text2text-generation']", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +tsk = args.task + +mod = mod.replace("\"", "").replace("'", "") +tsk = tsk.replace("\"", "").replace("'", "") + +TASK_TO_MODEL = {"text-generation": AutoModelForCausalLM, "text2text-generation": AutoModelForSeq2SeqLM} + +if tsk not in TASK_TO_MODEL: + raise Exception("Unsopported task! Supported task are ['text-generation', 'text2text-generation']") + +def merge_pdfs(pdfs: list): + merger = PdfMerger() + for pdf in pdfs: + merger.append(pdf) + merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf") + merger.close() + return f"{pdfs[-1].split('.')[0]}_results.pdf" + +def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None: + """ + Creates a persistent database from a PDF file. + + Args: + pdfpath (str): The path to the PDF file. + dbpath (str): The path to the storage folder for the persistent LocalDB. + cachepath (str): The path to the storage folder for the embeddings cache. + """ + print("Started the operation...") + a = time.time() + loader = PyPDFLoader(pdfpath) + documents = loader.load() + + ### Split the documents into smaller chunks for processing + text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) + texts = text_splitter.split_documents(documents) + + ### Use HuggingFace embeddings for transforming text into numerical vectors + ### This operation can take a while the first time but, once you created your local database with + ### cached embeddings, it should be a matter of seconds to load them! + embeddings = HuggingFaceEmbeddings() + store = LocalFileStore( + os.path.join( + cachepath, os.path.basename(pdfpath).split(".")[0] + "_cache" + ) + ) + cached_embeddings = CacheBackedEmbeddings.from_bytes_store( + underlying_embeddings=embeddings, + document_embedding_cache=store, + namespace=os.path.basename(pdfpath).split(".")[0], + ) + + b = time.time() + print( + f"Embeddings successfully created and stored at {os.path.join(cachepath, os.path.basename(pdfpath).split('.')[0]+'_cache')} under namespace: {os.path.basename(pdfpath).split('.')[0]}" + ) + print(f"To load and embed, it took: {b - a}") + + persist_directory = os.path.join( + dbpath, os.path.basename(pdfpath).split(".")[0] + "_localDB" + ) + vectordb = Chroma.from_documents( + documents=texts, + embedding=cached_embeddings, + persist_directory=persist_directory, + ) + c = time.time() + print( + f"Persistent database successfully created and stored at {os.path.join(dbpath, os.path.basename(pdfpath).split('.')[0] + '_localDB')}" + ) + print(f"To create a persistent database, it took: {c - b}") + return vectordb + +def convert_none_to_str(l: list): + newlist = [] + for i in range(len(l)): + if l[i] is None or type(l[i])==tuple: + newlist.append("") + else: + newlist.append(l[i]) + return tuple(newlist) + +def just_chatting( + task, + model, + tokenizer, + query, + vectordb, + chat_history=[] +): + """ + Implements a chat system using Hugging Face models and a persistent database. + + Args: + task (str): Task for the pipeline; for now supported task are ['text-generation', 'text2text-generation'] + model (AutoModelForCausalLM): Hugging Face model, already loaded and prepared. + tokenizer (AutoTokenizer): Hugging Face tokenizer, already loaded and prepared. + model_task (str): Task for the Hugging Face model. + persistent_db_dir (str): Directory for the persistent database. + embeddings_cache (str): Path to cache Hugging Face embeddings. + pdfpath (str): Path to the PDF file. + query (str): Question by the user + vectordb (ChromaDB): vectorstorer variable for retrieval. + chat_history (list): A list with previous questions and answers, serves as context; by default it is empty (it may make the model allucinate) + """ + ### Create a text-generation pipeline and connect it to a ConversationalRetrievalChain + pipe = pipeline(task, + model=model, + tokenizer=tokenizer, + max_new_tokens = 2048, + repetition_penalty = float(1.2), + ) + + local_llm = HuggingFacePipeline(pipeline=pipe) + llm_chain = ConversationalRetrievalChain.from_llm( + llm=local_llm, + chain_type="stuff", + retriever=vectordb.as_retriever(search_kwargs={"k": 1}), + return_source_documents=False, + ) + rst = llm_chain({"question": query, "chat_history": chat_history}) + return rst + + +try: + tokenizer = AutoTokenizer.from_pretrained( + mod, + ) + + + model = TASK_TO_MODEL[tsk].from_pretrained( + mod, + ) +except Exception as e: + import sys + print(f"The error {e} occured while handling model and tokenizer loading: please ensure that the model you provided was correct and suitable for the specified task. Be also sure that the HF repository for the loaded model contains all the necessary files.", file=sys.stderr) + sys.exit(1) + + diff --git a/.v0_1_1/scripts/__pycache__/utils.cpython-310.pyc b/.v0_1_1/scripts/__pycache__/utils.cpython-310.pyc new file mode 100644 index 0000000..01d7099 Binary files /dev/null and b/.v0_1_1/scripts/__pycache__/utils.cpython-310.pyc differ diff --git a/scripts/gemma-for-datasciences.ipynb b/.v0_1_1/scripts/gemma-for-datasciences.ipynb similarity index 100% rename from scripts/gemma-for-datasciences.ipynb rename to .v0_1_1/scripts/gemma-for-datasciences.ipynb diff --git a/scripts/gemma_for_datasciences.py b/.v0_1_1/scripts/gemma_for_datasciences.py similarity index 100% rename from scripts/gemma_for_datasciences.py rename to .v0_1_1/scripts/gemma_for_datasciences.py diff --git a/README.md b/README.md index bcae638..2c8315f 100644 --- a/README.md +++ b/README.md @@ -1,149 +1,70 @@ -# everything-rag +

everything-ai

+

Your fully proficient, AI-powered and local chatbot assistant🤖

->_How was this README generated? Levearaging the power of AI with **reAIdme**, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf._ -_Go and give it a try [here](https://hf.co/chat/assistant/660d9a4f590a7924eed02a32)!_ 🤖
- GitHub top language - GitHub commit activity - Static Badge - Static Badge - Static Badge - Static Badge + GitHub top language + GitHub commit activity + Static Badge + Static Badge + Docker image size + Static Badge
- Example chat -

Example chat with everything-rag, mediated by google/flan-t5-base

+ Flowchart +

Flowchart for everything-ai

- -### Table of Contents - -0. [TL;DR](#tldr) -1. [Introduction](#introduction) -2. [Inspiration](#inspiration) -2. [Getting Started](#getting-started) -3. [Using the Chatbot](#using-the-chatbot) -4. [Troubleshooting](#troubleshooting) -5. [Contributing](#contributing) -6. [Upcoming features](#upcoming-features) -7. [References](#reference) - -## TL;DR - -* This documentation is soooooo long, I want to get my hands dirty!!! - >You can try out everything-rag the [dedicated HuggingFace space](https://huggingface.co/spaces/as-cle-bert/everything-rag), based on google/flan-t5-large. - -
- -
- -## Introduction - -Introducing **everything-rag**, your fully customizable and local chatbot assistant! 🤖 - -With everything-rag, you can: - -1. Use virtually any LLM you want: Switch between different LLMs like _gemma-7b_ or _llama-7b_ to suit your needs. -2. Use your own data: everything-rag can work with any data you provide, whether it's a PDF about data sciences or a document about pallas' cats!🐈 -3. Enjoy 100% local and 100% free functionality: No need for hosted APIs or pay-as-you-go services. everything-rag is completely free to use and runs on your desktop. Plus, with the chat_history functionality in ConversationalRetrievalChain, you can easily retrieve and review previous conversations with your chatbot, making it even more convenient to use. - -While everything-rag offers many benefits, there are a couple of limitations to keep in mind: - -1. Performance-critical tasks: Loading large models (>1~2 GB) and generating text can be resource-intensive, so it's recommended to have at least 16GB RAM and 4 CPU cores for optimal performance. -2. Small LLMs can still allucinate: While large LLMs like _gemma-7b_ and _llama-7b_ tend to produce better results, smaller models like _openai-community/gpt2_ can still produce suboptimal responses in certain situations. - -In summary, everything-rag is a simple, customizable, and local chatbot assistant that offers a wide range of features and capabilities. By leveraging the power of RAG, everything-rag offers a unique and flexible chatbot experience that can be tailored to your specific needs and preferences. Whether you're looking for a simple chatbot to answer basic questions or a more advanced conversational AI to engage with your users, everything-rag has got you covered.😊 - -## Inspiration - -This project is a humble and modest carbon-copy of its main and true inspirations, i.e. [Jan.ai](https://jan.ai/), [Cheshire Cat AI](https://cheshirecat.ai/), [privateGPT](https://privategpt.io/) and many other projects that focus on making LLMs (and AI in general) open-source and easily accessible to everyone. - -## Getting Started - -You can do two things: - -- Play with generation on [Kaggle](https://www.kaggle.com/code/astrabertelli/gemma-for-datasciences) -- Clone this repository, head over to [the python script](./scripts/gemma_for_datasciences.py) and modify everything to your needs! -- Docker installation (🥳**FULLY IMPLEMENTED**): you can install everything-rag through docker image and running it thanks do Docker by following these really simple commands: - +## Quickstart +### 1. Clone this repository ```bash -docker pull ghcr.io/astrabert/everything-rag:latest -docker run -p 7860:7860 everything-rag:latest -m microsoft/phi-2 -t text-generation +git clone https://github.com/AstraBert/everything-ai.git +cd everything-ai ``` -- **IMPORTANT NOTE**: running the script within `docker run` does not log the port on which the app is running until you press `Ctrl+C`, but in that moment it also interrupt the execution! The app will run on port `0.0.0.0:7860` (or `localhost:7860` if your browser is Windows-based), so just make sure to open your browser on that port and to refresh it after 30s to 1 or 2 mins, when the model and the tokenizer should be loaded and the app should be ready to work! - -- As you can see, you just need to specify the LLM model and its task (this is mandatory). Keep in mind that, for what concerns v0.1.1, everything-rag supports only text-generation and text2text-generation. For these two tasks, you can use virtually *any* model from HuggingFace Hub: the sole recommendation is to watch out for your disk space, RAM and CPU power, LLMs can be quite resource-consuming! - -## Using the Chatbot - -### GUI - -The chatbot has a brand-new GradIO-based interface that runs on local server. You can interact by uploading directly your pdf files and/or sending messages, all by running: +### 2. Set your `.env` file +Modify the `VOLUME` variable in the .env file so that you can mount your local file system into Docker container. +An example could be: ```bash -python3 scripts/chat.py -m provider/modelname -t task +VOLUME="c:/Users/User/:/User/" ``` +This means that now everything that is under "c:/Users/User/" on your local machine is under "/User/" in your Docker container. -The suggested workflow is, nevertheless, the one that exploits Docker. - -### Code breakdown - notebook - -Everything is explained in [the dedicated notebook](./scripts/gemma-for-datasciences.ipynb), but here's a brief breakdown of the code: - -1. The first section imports the necessary libraries, including Hugging Face Transformers, langchain-community, and tkinter. -2. The next section installs the necessary dependencies, including the gemma-2b model, and defines some useful functions for making the LLM-based data science assistant work. -3. The create_a_persistent_db function creates a persistent database from a PDF file, using the PyPDFLoader to split the PDF into smaller chunks and the Hugging Face embeddings to transform the text into numerical vectors. The resulting database is stored in a LocalFileStore. -4. The just_chatting function implements a chat system using the Hugging Face model and the persistent database. It takes a query, tokenizes it, and passes it to the model to generate a response. The response is then returned as a dictionary of strings. -5. The chat_gui class defines a simple chat GUI that displays the chat history and allows the user to input queries. The send_message function is called when the user presses the "Send" button, and it sends the user's message to the just_chatting function to get a response. -6. The script then creates a root Tk object and instantiates a ChatGUI object, which starts the main loop. - -Et voilà, your chatbot is up and running!🦿 - -## Troubleshooting - -### Common Issues Q&A - -* Q: The chatbot is not responding😭 - > A: Make sure that the PDF document is in the specified path and that the database has been created successfully. -* Q: The chatbot is taking soooo long🫠 - > A: This is quite common with resource-limited environments that deal with too large or too small models: large models require **at least** 32 GB RAM and >8 core CPU, whereas small model can easily be allucinating and producing responses that are endless repetitions of the same thing! Check *penalty_score* parameter to avoid this. **try rephrasing the query and be as specific as possible** -* Q: My model is allucinating and/or repeating the same sentence over and over again😵‍💫 - > A: This is quite common with small or old models: check *penalty_score* and *temperature* parameter to avoid this. -* Q: The chatbot is giving incorrect/non-meaningful answers🤥 - >A: Check that the PDF document is relevant and up-to-date. Also, **try rephrasing the query and be as specific as possible** -* Q: An error occurred while generating the answer💔 - >A: This frequently occurs when your (small) LLM has a limited maximum hidden size (generally 512 or 1024) and the context that the retrieval-augmented chain produces goes beyond that maximum. You could, potentially, modify the configuration of the model, but this would mean dramatically increase its resource consumption, and your small laptop is not prepared to take it, trust me!!! A solution, if you have enough RAM and CPU power, is to switch to larger LLMs: they do not have problems in this sense. - -## Upcoming features🚀 - -- [ ] Multi-lingual support (expected for **version 0.2.0**) - -- [ ] More text-based tasks: question answering, summarisation (expected for **version 0.3.0**) - -- [ ] Computer vision: Image-to-text, image generation, image segmentation... (expected for **version 1.0.0**) +### 3. Pull the necessary images +```bash +docker pull astrabert/everything-ai +docker pull qdrant/qdrant +``` +### 4. Run the multi-container app +```bash +docker compose up +``` +### 5. Go to `localhost:8670` and choose your assistant -## Contributing +You will see something like this: +
+ Task choice interface +
-Contributions are welcome! If you would like to improve the chatbot's functionality or add new features, please fork the repository and submit a pull request. +Choose the task among: -## Reference +- *retrieval-text-generation*: use `qdrant` backend to build a retrieval-friendly knowledge base, which you can query and tune the response of your model on. You have to pass either a pdf/a bunch of pdfs specified as comma-separated paths or a directory where all the pdfs of interest are stored (**DO NOT** provide both); you can also specify the language in which the PDF is written, using [ISO nomenclature](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) - **MULTILINGUAL** +- *agnostic-text-generation*: ChatGPT-like text generation (no retrieval architecture), but supports every text-generation model on HF Hub (as long as your hardware supports it!) - **MULTILINGUAL** +- *text-summarization*: summarize text and pdfs, supports every text-summarization model on HF Hub - **ENGLISH ONLY** +- *image-generation*: stable diffusion, supports every text-to-image model on HF Hub - **MULTILINGUAL** +- *image-generation-pollinations*: stable diffusion, use Pollinations AI API; if you choose 'image-generation-pollinations', you do not need to specify anything else apart from the task - **MULTILINGUAL** +- *image-classification*: classify an image, supports every image-classification model on HF Hub - **ENGLISH ONLY** +- *image-to-text*: describe an image, supports every image-to-text model on HF Hub - **ENGLISH ONLY** +### 6. Go to `localhost:7860` and start using your assistant -* [Hugging Face Transformers](https://github.com/huggingface/transformers) -* [Langchain-community](https://github.com/langchain-community/langchain-community) -* [Tkinter](https://docs.python.org/3/library/tkinter.html) -* [PDF document about data science](https://www.kaggle.com/datasets/astrabertelli/what-is-datascience-docs) -* [GradIO](https://www.gradio.app/) +Once everything is ready, you can head over to `localhost:7860` and start using your assistant: -## License +
+ Chat interface +
-This project is licensed under the Apache 2.0 License. -If you use this work for your projects, please consider citing the author [Astra Bertelli](http://astrabert.vercel.app). +## Complete documentation is coming soon...🚀 \ No newline at end of file diff --git a/compose.yaml b/compose.yaml new file mode 100644 index 0000000..7c81222 --- /dev/null +++ b/compose.yaml @@ -0,0 +1,22 @@ +networks: + mynet: + driver: bridge + +services: + everything-ai: + image: astrabert/everything-ai + volumes: + - ${VOLUME} + networks: + - mynet + ports: + - "7860:7860" + - "8760:8760" + qdrant: + image: qdrant/qdrant + ports: + - "6333:6333" + volumes: + - "./qdrant_storage:/qdrant/storage" + networks: + - mynet \ No newline at end of file diff --git a/docker/Dockerfile b/docker/Dockerfile index 26b8040..a111e60 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -1,5 +1,5 @@ # Use an official Python runtime as a parent image -FROM python:3.10-slim-bookworm +FROM astrabert/everything-ai # Set the working directory in the container to /app WORKDIR /app @@ -7,25 +7,7 @@ WORKDIR /app # Add the current directory contents into the container at /app ADD . /app -# Update and install system dependencies -RUN apt-get update && apt-get install -y \ - build-essential \ - libpq-dev \ - libffi-dev \ - libssl-dev \ - musl-dev \ - libxml2-dev \ - libxslt1-dev \ - zlib1g-dev \ - && rm -rf /var/lib/apt/lists/* - -# Install Python dependencies -RUN python3 -m pip cache purge -RUN python3 -m pip install --no-cache-dir -r requirements.txt - - # Expose the port that the application will run on EXPOSE 7860 -# Set the entrypoint with a default command and allow the user to override it -ENTRYPOINT ["python3", "chat.py"] \ No newline at end of file +ENTRYPOINT [ "python3", "select_and_run.py" ] diff --git a/docker/__pycache__/utils.cpython-310.pyc b/docker/__pycache__/utils.cpython-310.pyc index a48daad..89576f1 100644 Binary files a/docker/__pycache__/utils.cpython-310.pyc and b/docker/__pycache__/utils.cpython-310.pyc differ diff --git a/docker/agnostic_text_generation.py b/docker/agnostic_text_generation.py new file mode 100644 index 0000000..889a6ab --- /dev/null +++ b/docker/agnostic_text_generation.py @@ -0,0 +1,42 @@ +import gradio as gr +from utils import Translation +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline +from argparse import ArgumentParser + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +mod = mod.replace("\"", "").replace("'", "") + +model_checkpoint = mod + +model = AutoModelForCausalLM.from_pretrained(model_checkpoint) +tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) + +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, repetition_penalty=1.2, temperature=0.4) + + +def reply(message, history): + txt = Translation(message, "en") + if txt.original == "en": + response = pipe(message) + return response[0]["generated_text"] + else: + translation = txt.translatef() + response = pipe(translation) + t = Translation(response[0]["generated_text"], txt.original) + res = t.translatef() + return res + + +demo = gr.ChatInterface(fn=reply, title="Multilingual-Bloom Bot") +demo.launch(server_name="0.0.0.0", share=False) \ No newline at end of file diff --git a/docker/image_classification.py b/docker/image_classification.py new file mode 100644 index 0000000..ca4cd6b --- /dev/null +++ b/docker/image_classification.py @@ -0,0 +1,41 @@ +from transformers import AutoModelForImageClassification, AutoImageProcessor, pipeline +from PIL import Image +from argparse import ArgumentParser + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +mod = mod.replace("\"", "").replace("'", "") + +model_checkpoint = mod + +model = AutoModelForImageClassification.from_pretrained(model_checkpoint) +processor = AutoImageProcessor.from_pretrained(model_checkpoint) + +pipe = pipeline("image-classification", model=model, image_processor=processor) + +def get_results(image, ppln=pipe): + img = Image.fromarray(image) + result = ppln(img) + scores = [] + labels = [] + for el in result: + scores.append(el["score"]) + labels.append(el["label"]) + return labels[scores.index(max(scores))] + +import gradio as gr +## Build interface with loaded image + ouput from the model +demo = gr.Interface(get_results, gr.Image(), "text") + +if __name__ == "__main__": + demo.launch(server_name="0.0.0.0", share=False) \ No newline at end of file diff --git a/docker/image_generation.py b/docker/image_generation.py new file mode 100644 index 0000000..7885816 --- /dev/null +++ b/docker/image_generation.py @@ -0,0 +1,44 @@ +from diffusers import DiffusionPipeline +import torch +from argparse import ArgumentParser + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +mod = mod.replace("\"", "").replace("'", "") + +model_checkpoint = mod + +pipe = DiffusionPipeline.from_pretrained(model_checkpoint, torch_dtype=torch.float32) + +import gradio as gr +from utils import Translation + + + +def reply(message, history): + txt = Translation(message, "en") + if txt.original == "en": + image = pipe(message).images[0] + image.save("generated_image.png") + return "Here's your image:\n![generated_image](generated_image.png)" + else: + translation = txt.translatef() + image = pipe(translation).images[0] + image.save("generated_image.png") + t = Translation("Here's your image:", txt.original) + res = t.translatef() + return f"{res}:\n![generated_image](generated_image.png)" + + +demo = gr.ChatInterface(fn=reply, title="everything-ai-sd-imgs") +demo.launch(server_name="0.0.0.0", share=False) \ No newline at end of file diff --git a/docker/image_generation_pollinations.py b/docker/image_generation_pollinations.py new file mode 100644 index 0000000..15cf488 --- /dev/null +++ b/docker/image_generation_pollinations.py @@ -0,0 +1,20 @@ +import gradio as gr +from utils import Translation + + + +def reply(message, history): + txt = Translation(message, "en") + if txt.original == "en": + image = f"https://pollinations.ai/p/{message.replace(' ', '_')}" + return f"Here's your image:\n![generated_image]({image})" + else: + translation = txt.translatef() + image = f"https://pollinations.ai/p/{translation.replace(' ', '_')}" + t = Translation("Here's your image:", txt.original) + res = t.translatef() + return f"{res}:\n![generated_image]({image})" + + +demo = gr.ChatInterface(fn=reply, title="everything-ai-pollinations-imgs") +demo.launch(server_name="0.0.0.0", share=False) \ No newline at end of file diff --git a/docker/image_to_text.py b/docker/image_to_text.py new file mode 100644 index 0000000..97f6554 --- /dev/null +++ b/docker/image_to_text.py @@ -0,0 +1,35 @@ +import torch +from transformers import pipeline +from PIL import Image +from argparse import ArgumentParser + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +mod = mod.replace("\"", "").replace("'", "") + +model_checkpoint = mod + +pipe = pipeline("image-to-text", model=model_checkpoint) + +def get_results(image, ppln=pipe): + img = Image.fromarray(image) + result = ppln(img, prompt="", generate_kwargs={"max_new_tokens": 1024}) + return result[0]["generated_text"].capitalize() + +import gradio as gr +## Build interface with loaded image + ouput from the model +demo = gr.Interface(get_results, gr.Image(), "text") + +if __name__ == "__main__": + demo.launch(server_name="0.0.0.0", share=False) + diff --git a/docker/requirements.txt b/docker/requirements.txt index b022f1f..486e55c 100644 --- a/docker/requirements.txt +++ b/docker/requirements.txt @@ -1,10 +1,14 @@ -langchain-community==0.0.13 -langchain==0.1.1 -pypdf==3.17.4 -sentence_transformers==2.2.2 -chromadb==0.4.22 -cryptography>=3.1 -gradio -transformers -trl -peft \ No newline at end of file +langchain-community==0.0.13 +langchain==0.1.1 +pypdf==3.17.4 +sentence_transformers==2.2.2 +transformers==4.39.3 +langdetect==1.0.9 +deep-translator==1.11.4 +torch==2.1.2 +gradio==4.25.0 +diffusers==0.27.2 +pydantic==2.6.4 +qdrant_client==1.9.0 +pillow==10.2.0 +accelerate \ No newline at end of file diff --git a/docker/retrieval_text_generation.py b/docker/retrieval_text_generation.py new file mode 100644 index 0000000..1089df4 --- /dev/null +++ b/docker/retrieval_text_generation.py @@ -0,0 +1,108 @@ +from utils import Translation, PDFdatabase, NeuralSearcher +import gradio as gr +from qdrant_client import QdrantClient +from sentence_transformers import SentenceTransformer +from argparse import ArgumentParser +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline +import sys +import os + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +argparse.add_argument( + "-pf", + "--pdf_file", + help="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated)", + required=False, + default="No file" +) + +argparse.add_argument( + "-d", + "--directory", + help="Directory where all your pdfs of interest are stored", + required=False, + default="No directory" +) + +argparse.add_argument( + "-l", + "--language", + help="Language of the written content contained in the pdfs", + required=False, + default="Same as query" +) + +args = argparse.parse_args() + + +mod = args.model +pdff = args.pdf_file +dirs = args.directory +lan = args.language + + +if pdff.replace("\\","").replace("'","") != "None" and dirs.replace("\\","").replace("'","") == "None": + pdfs = pdff.replace("\\","").replace("'","").split(",") +else: + pdfs = [os.path.join(dirs.replace("\\","").replace("'",""), f) for f in os.listdir(dirs.replace("\\","").replace("'","")) if f.endswith(".pdf")] + +client = QdrantClient(host="host.docker.internal", port="6333") +encoder = SentenceTransformer("all-MiniLM-L6-v2") + +pdfdb = PDFdatabase(pdfs, encoder, client) +pdfdb.preprocess() +pdfdb.collect_data() +pdfdb.qdrant_collection_and_upload() + +model = AutoModelForCausalLM.from_pretrained(mod) +tokenizer = AutoTokenizer.from_pretrained(mod) + +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, repetition_penalty=1.2, temperature=0.4) + +def reply(message, history): + global pdfdb + txt = Translation(message, "en") + if txt.original == "en" and lan.replace("\\","").replace("'","") == "None": + txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder) + results = txt2txt.search(message) + response = pipe(results[0]["text"]) + return response[0]["generated_text"] + elif txt.original == "en" and lan.replace("\\","").replace("'","") != "None": + txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder) + transl = Translation(message, lan.replace("\\","").replace("'","")) + message = transl.translatef() + results = txt2txt.search(message) + t = Translation(results[0]["text"], txt.original) + res = t.translatef() + response = pipe(res) + return response[0]["generated_text"] + elif txt.original != "en" and lan.replace("\\","").replace("'","") == "None": + txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder) + results = txt2txt.search(message) + transl = Translation(results[0]["text"], "en") + translation = transl.translatef() + response = pipe(translation) + t = Translation(response[0]["generated_text"], txt.original) + res = t.translatef() + return res + else: + txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder) + transl = Translation(message, lan.replace("\\","").replace("'","")) + message = transl.translatef() + results = txt2txt.search(message) + t = Translation(results[0]["text"], txt.original) + res = t.translatef() + response = pipe(res) + tr = Translation(response[0]["generated_text"], txt.original) + ress = tr.translatef() + return ress + +demo = gr.ChatInterface(fn=reply, title="everything-ai-retrievaltext") +demo.launch(server_name="0.0.0.0", share=False) \ No newline at end of file diff --git a/docker/select_and_run.py b/docker/select_and_run.py new file mode 100644 index 0000000..a399530 --- /dev/null +++ b/docker/select_and_run.py @@ -0,0 +1,58 @@ +import subprocess as sp +import gradio as gr + +TASK_TO_SCRIPT = {"retrieval-text-generation": "retrieval_text_generation.py", "agnostic-text-generation": "agnostic_text_generation.py", "text-summarization": "text_summarization.py", "image-generation": "image_generation.py", "image-generation-pollinations": "image_generation_pollinations.py", "image-classification": "image_classification.py", "image-to-text": "image_to_text.py"} + + +def build_command(tsk, mod="None", pdff="None", dirs="None", lan="None"): + if tsk != "retrieval-text-generation" and tsk != "image-generation-pollinations": + sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod}", shell=True) + return f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod}" + elif tsk == "retrieval-text-generation": + sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod} -pf '{pdff}' -d '{dirs}' -l '{lan}'", shell=True) + return f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod} -pf '{pdff}' -d '{dirs}' -l '{lan}'" + else: + sp.run(f"python3 {TASK_TO_SCRIPT[tsk]}", shell=True) + return f"python3 {TASK_TO_SCRIPT[tsk]}" + +demo = gr.Interface( + build_command, + [ + gr.Textbox( + label="Task", + info="Task you want your assistant to help you with", + lines=3, + value=f"Choose one of the following: {','.join(list(TASK_TO_SCRIPT.keys()))}; if you choose 'image-generation-pollinations', you do not need to specify anything else", + ), + gr.Textbox( + label="Model", + info="AI model you want your assistant to run with", + lines=3, + value="None", + ), + gr.Textbox( + label="PDF file(s)", + info="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated): only available with 'retrieval-text-generation'", + lines=3, + value="None", + ), + gr.Textbox( + label="Directory", + info="Directory where all your pdfs of interest are stored (only available with 'retrieval-text-generation')", + lines=3, + value="None", + ), + gr.Textbox( + label="Language", + info="Language of the written content contained in the pdfs", + lines=3, + value="None", + ), + ], + outputs="textbox", + theme=gr.themes.Base() +) +if __name__ == "__main__": + demo.launch(server_name="0.0.0.0", server_port=8760, share=False) + + \ No newline at end of file diff --git a/docker/text_summarization.py b/docker/text_summarization.py new file mode 100644 index 0000000..3124ecc --- /dev/null +++ b/docker/text_summarization.py @@ -0,0 +1,116 @@ +from transformers import pipeline +from argparse import ArgumentParser +from langchain.text_splitter import CharacterTextSplitter +from langchain_community.document_loaders import PyPDFLoader +from utils import merge_pdfs +import gradio as gr +import time + +histr = [[None, "Hi, I'm **everything-ai-summarization**🤖.\nI'm here to assist you and let you summarize _your_ texts and _your_ pdfs!\nCheck [my website](https://astrabert.github.io/everything-ai/) for troubleshooting and documentation reference\nHave fun!😊"]] + +argparse = ArgumentParser() +argparse.add_argument( + "-m", + "--model", + help="HuggingFace Model identifier, such as 'google/flan-t5-base'", + required=True, +) + +args = argparse.parse_args() + + +mod = args.model +mod = mod.replace("\"", "").replace("'", "") + +model_checkpoint = mod + +summarizer = pipeline("summarization", model=model_checkpoint) + +def convert_none_to_str(l: list): + newlist = [] + for i in range(len(l)): + if l[i] is None or type(l[i])==tuple: + newlist.append("") + else: + newlist.append(l[i]) + return tuple(newlist) + +def pdf2string(pdfpath): + loader = PyPDFLoader(pdfpath) + documents = loader.load() + + ### Split the documents into smaller chunks for processing + text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) + texts = text_splitter.split_documents(documents) + fulltext = "" + for text in texts: + fulltext += text.page_content+"\n\n\n" + return fulltext + +def add_message(history, message): + global histr + if history is not None: + if len(message["files"]) > 0: + history.append((message["files"], None)) + histr.append([message["files"], None]) + if message["text"] is not None and message["text"] != "": + history.append((message["text"], None)) + histr.append([message["text"], None]) + else: + history = histr + add_message(history, message) + return history, gr.MultimodalTextbox(value=None, interactive=False) + + +def bot(history): + global histr + if not history is None: + if type(history[-1][0]) != tuple: + text = history[-1][0] + response = summarizer(text, max_length=int(len(text.split(" "))*0.5), min_length=int(len(text.split(" "))*0.05), do_sample=False)[0] + response = response["summary_text"] + histr[-1][1] = response + history[-1][1] = "" + for character in response: + history[-1][1] += character + time.sleep(0.05) + yield history + if type(history[-1][0]) == tuple: + filelist = [] + for i in history[-1][0]: + filelist.append(i) + finalpdf = merge_pdfs(filelist) + text = pdf2string(finalpdf) + response = summarizer(text, max_length=int(len(text.split(" "))*0.5), min_length=int(len(text.split(" "))*0.05), do_sample=False)[0] + response = response["summary_text"] + histr[-1][1] = response + history[-1][1] = "" + for character in response: + history[-1][1] += character + time.sleep(0.05) + yield history + else: + history = histr + bot(history) + +with gr.Blocks() as demo: + chatbot = gr.Chatbot( + [[None, "Hi, I'm **everything-ai-summarization**🤖.\nI'm here to assist you and let you summarize _your_ texts and _your_ pdfs!\nCheck [my website](https://astrabert.github.io/everything-ai/) for troubleshooting and documentation reference\nHave fun!😊"]], + label="everything-rag", + elem_id="chatbot", + bubble_full_width=False, + ) + + chat_input = gr.MultimodalTextbox(interactive=True, file_types=["pdf"], placeholder="Enter message or upload file...", show_label=False) + + chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input]) + bot_msg = chat_msg.then(bot, chatbot, chatbot, api_name="bot_response") + bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input]) + + +demo.queue() + +if __name__ == "__main__": + demo.launch(server_name="0.0.0.0", share=False) + + \ No newline at end of file diff --git a/docker/utils.py b/docker/utils.py index add4351..29989ec 100644 --- a/docker/utils.py +++ b/docker/utils.py @@ -1,171 +1,94 @@ -from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline -import time -from langchain_community.llms import HuggingFacePipeline -from langchain.storage import LocalFileStore -from langchain.embeddings import CacheBackedEmbeddings -from langchain_community.vectorstores import Chroma -from langchain.text_splitter import CharacterTextSplitter -from langchain_community.document_loaders import PyPDFLoader -from langchain_community.embeddings import HuggingFaceEmbeddings -from langchain.chains import ConversationalRetrievalChain -import os -from pypdf import PdfMerger -from argparse import ArgumentParser - -argparse = ArgumentParser() -argparse.add_argument( - "-m", - "--model", - help="HuggingFace Model identifier, such as 'google/flan-t5-base'", - required=True, -) - -argparse.add_argument( - "-t", - "--task", - help="Task for the model: for now supported task are ['text-generation', 'text2text-generation']", - required=True, -) - -args = argparse.parse_args() - - -mod = args.model -tsk = args.task - -mod = mod.replace("\"", "").replace("'", "") -tsk = tsk.replace("\"", "").replace("'", "") - -TASK_TO_MODEL = {"text-generation": AutoModelForCausalLM, "text2text-generation": AutoModelForSeq2SeqLM} - -if tsk not in TASK_TO_MODEL: - raise Exception("Unsopported task! Supported task are ['text-generation', 'text2text-generation']") - -def merge_pdfs(pdfs: list): - merger = PdfMerger() - for pdf in pdfs: - merger.append(pdf) - merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf") - merger.close() - return f"{pdfs[-1].split('.')[0]}_results.pdf" - -def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None: - """ - Creates a persistent database from a PDF file. - - Args: - pdfpath (str): The path to the PDF file. - dbpath (str): The path to the storage folder for the persistent LocalDB. - cachepath (str): The path to the storage folder for the embeddings cache. - """ - print("Started the operation...") - a = time.time() - loader = PyPDFLoader(pdfpath) - documents = loader.load() - - ### Split the documents into smaller chunks for processing - text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) - texts = text_splitter.split_documents(documents) - - ### Use HuggingFace embeddings for transforming text into numerical vectors - ### This operation can take a while the first time but, once you created your local database with - ### cached embeddings, it should be a matter of seconds to load them! - embeddings = HuggingFaceEmbeddings() - store = LocalFileStore( - os.path.join( - cachepath, os.path.basename(pdfpath).split(".")[0] + "_cache" - ) - ) - cached_embeddings = CacheBackedEmbeddings.from_bytes_store( - underlying_embeddings=embeddings, - document_embedding_cache=store, - namespace=os.path.basename(pdfpath).split(".")[0], - ) - - b = time.time() - print( - f"Embeddings successfully created and stored at {os.path.join(cachepath, os.path.basename(pdfpath).split('.')[0]+'_cache')} under namespace: {os.path.basename(pdfpath).split('.')[0]}" - ) - print(f"To load and embed, it took: {b - a}") - - persist_directory = os.path.join( - dbpath, os.path.basename(pdfpath).split(".")[0] + "_localDB" - ) - vectordb = Chroma.from_documents( - documents=texts, - embedding=cached_embeddings, - persist_directory=persist_directory, - ) - c = time.time() - print( - f"Persistent database successfully created and stored at {os.path.join(dbpath, os.path.basename(pdfpath).split('.')[0] + '_localDB')}" - ) - print(f"To create a persistent database, it took: {c - b}") - return vectordb - -def convert_none_to_str(l: list): - newlist = [] - for i in range(len(l)): - if l[i] is None or type(l[i])==tuple: - newlist.append("") - else: - newlist.append(l[i]) - return tuple(newlist) - -def just_chatting( - task, - model, - tokenizer, - query, - vectordb, - chat_history=[] -): - """ - Implements a chat system using Hugging Face models and a persistent database. - - Args: - task (str): Task for the pipeline; for now supported task are ['text-generation', 'text2text-generation'] - model (AutoModelForCausalLM): Hugging Face model, already loaded and prepared. - tokenizer (AutoTokenizer): Hugging Face tokenizer, already loaded and prepared. - model_task (str): Task for the Hugging Face model. - persistent_db_dir (str): Directory for the persistent database. - embeddings_cache (str): Path to cache Hugging Face embeddings. - pdfpath (str): Path to the PDF file. - query (str): Question by the user - vectordb (ChromaDB): vectorstorer variable for retrieval. - chat_history (list): A list with previous questions and answers, serves as context; by default it is empty (it may make the model allucinate) - """ - ### Create a text-generation pipeline and connect it to a ConversationalRetrievalChain - pipe = pipeline(task, - model=model, - tokenizer=tokenizer, - max_new_tokens = 2048, - repetition_penalty = float(1.2), - ) - - local_llm = HuggingFacePipeline(pipeline=pipe) - llm_chain = ConversationalRetrievalChain.from_llm( - llm=local_llm, - chain_type="stuff", - retriever=vectordb.as_retriever(search_kwargs={"k": 1}), - return_source_documents=False, - ) - rst = llm_chain({"question": query, "chat_history": chat_history}) - return rst - - -try: - tokenizer = AutoTokenizer.from_pretrained( - mod, - ) - - - model = TASK_TO_MODEL[tsk].from_pretrained( - mod, - ) -except Exception as e: - import sys - print(f"The error {e} occured while handling model and tokenizer loading: please ensure that the model you provided was correct and suitable for the specified task. Be also sure that the HF repository for the loaded model contains all the necessary files.", file=sys.stderr) - sys.exit(1) - - +# f(x)s that now are useful for all the tasks +from langdetect import detect +from deep_translator import GoogleTranslator +from pypdf import PdfMerger +from qdrant_client import models +from langchain.text_splitter import CharacterTextSplitter +from langchain_community.document_loaders import PyPDFLoader +import os + +def remove_items(test_list, item): + res = [i for i in test_list if i != item] + return res + +def merge_pdfs(pdfs: list): + merger = PdfMerger() + for pdf in pdfs: + merger.append(pdf) + merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf") + merger.close() + return f"{pdfs[-1].split('.')[0]}_results.pdf" + +class NeuralSearcher: + def __init__(self, collection_name, client, model): + self.collection_name = collection_name + # Initialize encoder model + self.model = model + # initialize Qdrant client + self.qdrant_client = client + def search(self, text: str): + # Convert text query into vector + vector = self.model.encode(text).tolist() + + # Use `vector` for search for closest vectors in the collection + search_result = self.qdrant_client.search( + collection_name=self.collection_name, + query_vector=vector, + query_filter=None, # If you don't want any filters for now + limit=1, # 5 the most closest results is enough + ) + # `search_result` contains found vector ids with similarity scores along with the stored payload + # In this function you are interested in payload only + payloads = [hit.payload for hit in search_result] + return payloads + +class PDFdatabase: + def __init__(self, pdfs, encoder, client): + self.finalpdf = merge_pdfs(pdfs) + self.collection_name = os.path.basename(self.finalpdf).split(".")[0].lower() + self.encoder = encoder + self.client = client + def preprocess(self): + loader = PyPDFLoader(self.finalpdf) + documents = loader.load() + ### Split the documents into smaller chunks for processing + text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) + self.pages = text_splitter.split_documents(documents) + def collect_data(self): + self.documents = [] + for text in self.pages: + contents = text.page_content.split("\n") + contents = remove_items(contents, "") + for content in contents: + self.documents.append({"text": content, "source": text.metadata["source"], "page": str(text.metadata["page"])}) + def qdrant_collection_and_upload(self): + self.client.recreate_collection( + collection_name=self.collection_name, + vectors_config=models.VectorParams( + size=self.encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model + distance=models.Distance.COSINE, + ), + ) + self.client.upload_points( + collection_name=self.collection_name, + points=[ + models.PointStruct( + id=idx, vector=self.encoder.encode(doc["text"]).tolist(), payload=doc + ) + for idx, doc in enumerate(self.documents) + ], + ) + +class Translation: + def __init__(self, text, destination): + self.text = text + self.destination = destination + try: + self.original = detect(self.text) + except Exception as e: + self.original = "auto" + def translatef(self): + translator = GoogleTranslator(source=self.original, target=self.destination) + translation = translator.translate(self.text) + return translation + diff --git a/imgs/chatbot.png b/imgs/chatbot.png new file mode 100644 index 0000000..e3cbcfb Binary files /dev/null and b/imgs/chatbot.png differ diff --git a/imgs/everything-ai.drawio.png b/imgs/everything-ai.drawio.png new file mode 100644 index 0000000..5282222 Binary files /dev/null and b/imgs/everything-ai.drawio.png differ diff --git a/imgs/select_and_run.png b/imgs/select_and_run.png new file mode 100644 index 0000000..7d9c71f Binary files /dev/null and b/imgs/select_and_run.png differ