From 67aceb2a384a4bb395311de62236db75a0332c38 Mon Sep 17 00:00:00 2001 From: Michael Hunger Date: Fri, 12 Jul 2024 22:47:28 +0200 Subject: [PATCH] llm-builder-update v0.3 (#93) Covering New features * different RAG modes * graph quality configuration * webpages as sources * more LLM models supported (not in public deployment, but for customers) * better graph visualization (lexical + entity graph) + local filtering * lots of more details --- .../pages/llm-graph-builder-deployment.adoc | 123 ++++++++++++++---- .../pages/llm-graph-builder-features.adoc | 92 +++++++++++-- .../pages/llm-graph-builder.adoc | 34 +++-- 3 files changed, 201 insertions(+), 48 deletions(-) diff --git a/modules/genai-ecosystem/pages/llm-graph-builder-deployment.adoc b/modules/genai-ecosystem/pages/llm-graph-builder-deployment.adoc index e55a0b5..dea4a5e 100644 --- a/modules/genai-ecosystem/pages/llm-graph-builder-deployment.adoc +++ b/modules/genai-ecosystem/pages/llm-graph-builder-deployment.adoc @@ -11,13 +11,11 @@ include::_graphacademy_llm.adoc[] == Prerequisites -You will need to have a Neo4j Database V5.15 or later with https://neo4j.com/docs/apoc/current/installation/[APOC installed^] to use this Knowledge Graph Builder. -You can use any https://neo4j.com/aura/[Neo4j Aura database^] (including the free tier database). Neo4j Aura automatically includes APOC and run on the latest Neo4j version, making it a great choice to get started quickly. You can also use the free trial in https://sandbox.neo4j.com[Neo4j Sandbox^], which also includes Graph Data Science. +You will need to have a Neo4j Database V5.18 or later with https://neo4j.com/docs/apoc/current/installation/[APOC installed^] to use this Knowledge Graph Builder. +You can use any https://neo4j.com/aura/[Neo4j Aura database^] (including the free tier database). Neo4j Aura automatically includes APOC and run on the latest Neo4j version, making it a great choice to get started quickly. -[CAUTION] -==== -If want to use https://neo4j.com/download/[Neo4j Desktop^] instead, you will not be able to use the docker-compose deployment method. You will have to follow the link:/labs/genai-ecosystem/llm-graph-builder-deployment/#dev-deployment[separate deployment of backend and frontend section]. -==== +You can also use the free trial in https://sandbox.neo4j.com[Neo4j Sandbox^], which also includes Graph Data Science. +If want to use https://neo4j.com/product/developer-tools/#desktop[Neo4j Desktop^] instead, you need to configure your `NEO4J_URI=bolt://host.docker.internal` to allow the Docker container to access the network running on your computer. == Docker-compose @@ -46,6 +44,49 @@ You can then run Docker Compose to build and start all components: docker-compose up --build ``` +=== Configuring LLM Models + +You can configure the following LLM models besides the ones supported out of the box: + +* OpenAI GPT 3.5 and 4o (default) +* VertexAI (Gemini 1.0) (default) +* VertexAI (Gemini 1.5) +* Diffbot +* Bedrock models +* Anthropic +* OpenAI API compatible models like Ollama, Groq, Fireworks + +To achieve that you need to set a number of environment variables: + +In your `.env` file, add the following lines. You can of course also add other model configurations from these providers or any OpenAI API compatible provider. + +[source,env] +==== +LLM_MODEL_CONFIG_azure_ai_gpt_35="gpt-35,https://.openai.azure.com/,," +LLM_MODEL_CONFIG_anthropic_claude_35_sonnet="claude-3-5-sonnet-20240620," +LLM_MODEL_CONFIG_fireworks_llama_v3_70b="accounts/fireworks/models/llama-v3-70b-instruct," +LLM_MODEL_CONFIG_bedrock_claude_35_sonnet="anthropic.claude-3-sonnet-20240229-v1:0,," +LLM_MODEL_CONFIG_ollama_llama3="llama3,http://host.docker.internal:11434" +LLM_MODEL_CONFIG_fireworks_qwen_72b="accounts/fireworks/models/qwen2-72b-instruct," + +# Optional Frontend config +LLM_MODELS="diffbot,gpt-3.5,gpt-4o,azure_ai_gpt_35,azure_ai_gpt_4o,groq_llama3_70b,anthropic_claude_35_sonnet,fireworks_llama_v3_70b,bedrock_claude_35_sonnet,ollama_llama3,fireworks_qwen_72b" +==== + +In your `docker-compose.yml` you need to pass the variables through: + +[source,yaml] +==== +- LLM_MODEL_CONFIG_anthropic_claude_35_sonnet=${LLM_MODEL_CONFIG_anthropic_claude_35_sonnet-} +- LLM_MODEL_CONFIG_fireworks_llama_v3_70b=${LLM_MODEL_CONFIG_fireworks_llama_v3_70b-} +- LLM_MODEL_CONFIG_azure_ai_gpt_4o=${LLM_MODEL_CONFIG_azure_ai_gpt_4o-} +- LLM_MODEL_CONFIG_azure_ai_gpt_35=${LLM_MODEL_CONFIG_azure_ai_gpt_35-} +- LLM_MODEL_CONFIG_groq_llama3_70b=${LLM_MODEL_CONFIG_groq_llama3_70b-} +- LLM_MODEL_CONFIG_bedrock_claude_3_5_sonnet=${LLM_MODEL_CONFIG_bedrock_claude_3_5_sonnet-} +- LLM_MODEL_CONFIG_fireworks_qwen_72b=${LLM_MODEL_CONFIG_fireworks_qwen_72b-} +- LLM_MODEL_CONFIG_ollama_llama3=${LLM_MODEL_CONFIG_ollama_llama3-} +==== + === Additional configs By default, the input sources will be: Local files, Youtube, Wikipedia and AWS S3. @@ -92,32 +133,68 @@ uvicorn score:app --reload == ENV +=== Processing Configuration + [options="header",cols="m,a,m,a"] |=== | Env Variable Name | Mandatory/Optional | Default Value | Description -| OPENAI_API_KEY | Optional | sk-... | API key for OpenAI (if enabled) -| DIFFBOT_API_KEY | Optional | | API key for Diffbot (if enabled) -| EMBEDDING_MODEL | Optional | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai) -| IS_EMBEDDING | Optional | true | Flag to enable text embedding +| IS_EMBEDDING | Optional | true | Flag to enable text embedding for chunks +| ENTITY_EMBEDDING | Optional | false | Flag to enable entity embedding (id and description) | KNN_MIN_SCORE | Optional | 0.94 | Minimum score for KNN algorithm for connecting similar Chunks -| GEMINI_ENABLED | Optional | False | Flag to enable Gemini -| GCP_LOG_METRICS_ENABLED | Optional | False | Flag to enable Google Cloud logs | NUMBER_OF_CHUNKS_TO_COMBINE | Optional | 6 | Number of chunks to combine when extracting entities | UPDATE_GRAPH_CHUNKS_PROCESSED | Optional | 20 | Number of chunks processed before writing to the database and updating progress -| NEO4J_URI | Optional | neo4j://database:7687 | URI for Neo4j database -| NEO4J_USERNAME | Optional | neo4j | Username for Neo4j database -| NEO4J_PASSWORD | Optional | password | Password for Neo4j database -| LANGCHAIN_API_KEY | Optional | | API key for LangSmith -| LANGCHAIN_PROJECT | Optional | | Project for LangSmith -| LANGCHAIN_TRACING_V2 | Optional | true | Flag to enable LangSmith tracing -| LANGCHAIN_ENDPOINT | Optional | https://api.smith.langchain.com | Endpoint for LangSmith API -| BACKEND_API_URL | Optional | http://localhost:8000 | URL for backend API -| BLOOM_URL | Optional | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph | URL for Bloom visualization -| REACT_APP_SOURCES | Optional | local,youtube,wiki,s3 | List of input sources that will be available -| LLM_MODELS | Optional | diffbot,gpt-3.5,gpt-4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot (other models: `gemini-1.0-pro, gemini-1.5-pro`) | ENV | Optional | DEV | Environment variable for the app | TIME_PER_CHUNK | Optional | 4 | Time per chunk for processing | CHUNK_SIZE | Optional | 5242880 | Size of each chunk for processing +|=== + +=== Front-End Configuration + +[options="header",cols="m,a,m,a"] +|=== +| Env Variable Name | Mandatory/Optional | Default Value | Description +| BACKEND_API_URL | Optional | http://localhost:8000 | URL for backend API +| REACT_APP_SOURCES | Optional | local,youtube,wiki,s3 | List of input sources that will be available +| BLOOM_URL | Optional | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph | URL for Bloom visualization +|=== + + +=== GCP Cloud Integration + +[options="header",cols="m,a,m,a"] +|=== +| Env Variable Name | Mandatory/Optional | Default Value | Description +| GEMINI_ENABLED | Optional | False | Flag to enable Gemini +| GCP_LOG_METRICS_ENABLED | Optional | False | Flag to enable Google Cloud logs | GOOGLE_CLIENT_ID | Optional | | Client ID for Google authentication for GCS upload | GCS_FILE_CACHE | Optional | False | If set to True, will save the files to process into GCS. If set to False, will save the files locally |=== + +=== LLM Model Configuration + +[options="header",cols="m,a,m,a"] +|=== +| Env Variable Name | Mandatory/Optional | Default Value | Description +| LLM_MODELS | Optional | diffbot,gpt-3.5,gpt-4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot (other models: `gemini-1.0-pro, gemini-1.5-pro`) +| OPENAI_API_KEY | Optional | sk-... | API key for OpenAI (if enabled) +| DIFFBOT_API_KEY | Optional | | API key for Diffbot (if enabled) +| EMBEDDING_MODEL | Optional | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai) +| GROQ_API_KEY | Optional | | API key for Groq +| GEMINI_ENABLED | Optional | False | Flag to enable Gemini +| LLM_MODEL_CONFIG_=",,," | Optional | | Configuration for additional LLM models +|=== + + +=== LangChain and Neo4j Configuration + +[options="header",cols="m,a,m,a"] +|=== +| Env Variable Name | Mandatory/Optional | Default Value | Description +| NEO4J_URI | Optional | neo4j://database:7687 | URI for Neo4j database for the backend to connect to +| NEO4J_USERNAME | Optional | neo4j | Username for Neo4j database for the backend to connect to +| NEO4J_PASSWORD | Optional | password | Password for Neo4j database for the backend to connect to +| LANGCHAIN_API_KEY | Optional | | API key for LangSmith +| LANGCHAIN_PROJECT | Optional | | Project for LangSmith +| LANGCHAIN_TRACING_V2 | Optional | false | Flag to enable LangSmith tracing +| LANGCHAIN_ENDPOINT | Optional | https://api.smith.langchain.com | Endpoint for LangSmith API +|=== diff --git a/modules/genai-ecosystem/pages/llm-graph-builder-features.adoc b/modules/genai-ecosystem/pages/llm-graph-builder-features.adoc index 29b9ed9..644bebc 100644 --- a/modules/genai-ecosystem/pages/llm-graph-builder-features.adoc +++ b/modules/genai-ecosystem/pages/llm-graph-builder-features.adoc @@ -12,7 +12,9 @@ include::_graphacademy_llm.adoc[] == Sources === Local file upload -You can drag & drop files into the first input zone on the left. The application will store the uploaded sources as Document nodes in the graph using LangChain Loaders. + +You can drag & drop files into the first input zone on the left. The application will store the uploaded sources as Document nodes in the graph using LangChain Loaders (PDFLoader and Unstructured Loader). + |=== | File Type | Supported Extensions @@ -22,35 +24,103 @@ You can drag & drop files into the first input zone on the left. The application | Text | .html, .txt, .md |=== -=== Youtube -The second input will let you copy/paste the link of a YouTube video you want to use. The application will parse and store the uploaded YouTube videos (transcript) as a Document nodes in the graph using YouTube parsers. +=== Web Links + +The second input zone handles web links. + +* YouTube transcripts +* Wikipedia pages +* Web Pages + +The application will parse and store the uploaded YouTube videos (transcript) as a Document nodes in the graph using YouTube parsers. -=== Wikipedia -The third input takes a Wikipedia page URL as input. For example, you can provide `https://en.wikipedia.org/wiki/Neo4j` and it will load the Neo4j Wikipedia page. +For Wikipedia links we use the Wikipedia Loader. For example, you can provide `https://en.wikipedia.org/wiki/Neo4j` and it will load the Neo4j Wikipedia page. + +For web pages, we use the Unstructured Loader. For example, you can provide articles from `https://theguardian.com/` and it will load the article content. + +== Cloud Storage === AWS S3 + This AWS S3 integration allows you to connect to an S3 bucket and load the files from there. You will need to provide your AWS credentials and the bucket name. === Google Cloud Storage -This Google Cloud Storage integration allows you to connect to a GCS bucket and load the files from there. You will need to provide your Google Cloud Project ID and the bucket name. + +This Google Cloud Storage integration allows you to connect to a GCS bucket and load the files from there. You will have provide your GCS bucket name and optionally a folder an follow an auth flow to give the application access to the bucket. == LLM Models -The application uses ML models (LLMs: OpenAI, Gemini, Diffbot) to transform PDFs, web pages, and YouTube videos into a knowledge graph of entities and their relationships. ENV variables can be set to enable/disable specific models. -== Graph Schema +The application uses ML models to transform PDFs, web pages, and YouTube video transcripts into a knowledge graph of entities and their relationships. ENV variables can be set to configure/enable/disable specific models. + +The following models are configured (but only the first 3 are available in the publicly hosted version) + +* OpenAI GPT 3.5 and 4o +* VertexAI (Gemini 1.0), +* Diffbot +* Bedrock, +* Anthropic +* OpenAI API compatible models like Ollama, Groq, Fireworks + +The selected LLM model will both be used for processing the newly uploaded files and for powering the chatbot. Please note that the models have different capabilities, so they will work not equally well especially for extraction. + +== Graph Enhancements + +=== Graph Schema + image::llm-graph-builder-taxonomy.png[width=600, align=center] -If you want to use a pre-defined or your own graph schema, you can click on the setting icon in the top right corner and either select a pre-defined schema from the dropdown, use your own by writing down the node labels and relationships, pull the existing schema from an existing Neo4j database (`Use Existing Schema`), or copy/paste a text and ask the LLM to analyze it and come up with a suggested schema (`Get Schema From Text`). + +If you want to use a pre-defined or your own graph schema, you can do so in the Graph Enhancements popup. This is also shown the first time you construct a graph and the state of the model configuration is listed below the connection information. + + You can either: + * select a pre-defined schema from the dropdown on top, + * use your own by entering the node labels and relationships, + * fetch the existing schema from an existing Neo4j database (`Use Existing Schema`), + * or copy/paste a text or schema description (also works with RDF ontologies or Cypher/GQL schema) and ask the LLM to analyze it and come up with a suggested schema (`Get Schema From Text`). + +=== Delete Disconnected Nodes + +When extracting entities, it can happen that after the extraction a number of nodes are only connected to text chunks but not to other entities. +Which results to disconnected entities in the entity graph. + +While they can hold relevant information for question answering they might affect your downstream usage. So in this view you can select which of the entities that are only connected to text chunks should be deleted. + +//// +=== Merging Duplicate Entities + +While the prompt instructs the LLM to extract unique identifier for entities, across chunks and documents the same entity can end up with different spellings as duplicate in the graph. + +Here we use a mixture of entity embedding, edit distance and substring containment to generate a list of potentially duplicate entities that can be merged. + +You can select which sets of entities should be merged and exclude certain entities from the merge. +//// == Chatbot === How it works -When the user asks a question, we use the Neo4j Vector Index with a Retrieval Query to find the related chunks and entities connected together, up to a depth of 2 hops. We also summarize the chat history and use it as an element to enrich the context. -The various inputs and sources (the question, vector results, chat history) are all sent to the selected LLM model in a custom prompt, asking to provide and format a response to the question asked based on the elements and context provided. Of course, there is more magic to the prompt such as formatting, asking to cite sources, not speculating if the answer is not known, etc. The full prompt and instructions can be found as FINAL_PROMPT in QA_integration.py. +When the user asks a question, we use the configured RAG mode to answer it with the data from the graph of extracted documents. That can mean the question is turned into an embedding or a graph query or a more advanced RAG approach. + +We also summarize the chat history and use it as an element to enrich the context. === Features + +- *Select RAG Mode* you can select vector-only or GraphRAG (vector+graph) modes - *Clear chat:* Will delete the current session's chat history. - *Expand view:* Will open the chatbot interface in a fullscreen mode. - *Details:* Will open a Retrieval information pop-up showing details on how the RAG agent collected and used sources (documents), chunks, and entities. Also provides information on the model used and the token consumption. - *Copy:* Will copy the content of the response to the clipboard. - *Text-To-Speech:* Will read out loud the content of the response. + +=== GraphRAG + +For GraphRAG we use the Neo4j Vector Index (and a fulltext index for hybrid search) with a Retrieval Query to find the most relevant chunks and entities connected to these and then follow the entity relationships up to a depth of 2 hops. + +=== Vectory Only RAG + +For Vector only RAG we only use the vector and fulltext index (hybrid) search results and don't include additional information from the entity graph. + +=== Answer Generation + +The various inputs and determined sources (the question, vector results, entities (name + description), relationship pairs, chat history) are all sent to the selected LLM model as context information in a custom prompt, asking to provide and format a response to the question asked based on the elements and context provided. + +Of course, there is more magic to the prompt such as formatting, asking to cite sources, not speculating if the answer is not known, etc. The full prompt and instructions can be found in the https://github.com/neo4j-labs/llm-graph-builder[GitHub repository^]. diff --git a/modules/genai-ecosystem/pages/llm-graph-builder.adoc b/modules/genai-ecosystem/pages/llm-graph-builder.adoc index 02806b7..e1079be 100644 --- a/modules/genai-ecosystem/pages/llm-graph-builder.adoc +++ b/modules/genai-ecosystem/pages/llm-graph-builder.adoc @@ -1,4 +1,5 @@ -= Neo4j LLM Knowledge Graph Builder - Extract Nodes and Relationships from Unstructured Text (PDF, YouTube, Webpages) += Neo4j LLM Knowledge Graph Builder - Extract Nodes and Relationships from Unstructured Text + +(PDF, Documents, YouTube, Webpages) include::_graphacademy_llm.adoc[] :slug: llm-graph-builder :author: Michael Hunger, Tomaz Bratanic, Persistent @@ -9,36 +10,41 @@ include::_graphacademy_llm.adoc[] :page-product: llm-graph-builder :imagesdir: https://dev.assets.neo4j.com/wp-content/uploads/2024/ -image::llm-graph-builder.png[width=600, align=center] +// image::llm-graph-builder.png[width=600, align=center] +image::https://dist.neo4j.com/wp-content/uploads/20240618104511/build-kg-genai-e1718732751482.png[width=800, align=center] The Neo4j LLM Knowledge Graph Builder is an https://llm-graph-builder.neo4jlabs.com/[online application^] for turning unstructured text into a knowledge graph, it provides a magical text to graph experience. -It uses ML models (LLM - OpenAI, Gemini, Llama3, Diffbot) to transform PDFs, web pages, and YouTube videos into a graph of entities and their relationships, which it stores in your Neo4j database. +It uses ML models (LLM - OpenAI, Gemini, Llama3, Diffbot, Claude, Qwen) to transform PDFs, documents, images, web pages, and YouTube video transcripts. +The extraction turns them into a lexical graph of documents and chunks (with embeddings) and an entity graph with nodes and their relationships, which are both stored in your Neo4j database. +You can configure the extraction schema and apply clean-up operations after the extraction. + +Afterwards you can use different RAG approaches (GraphRAG, Vector, Text2Cypher) to ask questions of your data and see how the extracted data is used to construct the answers. [NOTE] ==== * best results for files with long-form text in English -* not suited for tabular data like Excel or CSV or images/diagrams/slides -* higher quality data extraction if you configure types for nodes and relationships in the settings (icon:gear[]) +* less well suited for tabular data like Excel or CSV or images/diagrams/slides +* higher quality data extraction if you configure the graph schema for nodes and relationship types ==== -The front-end is a React Application and the back-end a Python FastAPI application. -It uses the https://python.langchain.com/docs/use_cases/graph/constructing[llm-graph-transformer module^] that Neo4j contributed to LangChain. +The front-end is a React Application and the back-end a Python FastAPI application running on Google Cloud Run, but you can deploy it locally using docker compose. +It uses the https://python.langchain.com/docs/use_cases/graph/constructing[llm-graph-transformer module^] that Neo4j contributed to LangChain and other langchain integrations (e.g. for GraphRAG search). Here is a quick demo: ++++ - + ++++ -== Functionality Includes +== Step by Step Instructions 1. Open the https://llm-graph-builder.neo4jlabs.com/[LLM-Knowledge Graph Builder^] 2. Connect to a https://console.neo4j.io[Neo4j (Aura)^] instance -3. Provide your PDF files, Youtube URLs, Wikipedia Keywords or S3/GCS buckets -4. Construct Graph with LLM / Diffbot +3. Provide your PDFs, Documents, URLs or S3/GCS buckets +4. Construct Graph with the selected LLM 5. Visualize Knowledge Graph in App -6. Chat with your data with GraphRAG +6. Chat with your data using GraphRAG 7. Open Neo4j Bloom for further visual exploration 8. Use the constructed knowledge graph in your applications @@ -62,10 +68,10 @@ image::llm-graph-builder-viz.png[width=600, align=center] 5. Highly similar Chunks are connected with a `SIMILAR` relationship to form a kNN Graph 6. Embeddings are computed and stored in the Chunks and Vector index 7. Using the llm-graph-transformer or diffbot-graph-transformer entities and relationships are extracted from the text -8. Entities are stored in the graph and connected to the originating Chunks - +8. Entities and relationships are stored in the graph and connected to the originating Chunks // TODO architecture diagram +image::https://dist.neo4j.com/wp-content/uploads/20240618104514/retrieval-information-e1718732797663.png[width=800, align=center] == Relevant Links [cols="1,4"]