Updated Azure doc ingestion example to include RAG chatbot (#712)

LangStream · Nov 13, 2023 · 63b3c05 · 63b3c05
1 parent 09b8ef3
commit 63b3c05
Show file tree

Hide file tree

Showing 20 changed files with 469 additions and 57 deletions.
diff --git a/examples/applications/azure-document-ingestion/.langstreamignore b/examples/applications/azure-document-ingestion/.langstreamignore
@@ -0,0 +1,149 @@
+# .langstreamignore file inspired by https://github.com/github/gitignore/blob/main/Python.gitignore
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# These folders hold the libs built for the target
+# and we need them in the package
+!python/lib/
+!java/lib/
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+Pipfile.lock
+
+# poetry
+poetry.lock
+
+# pdm
+pdm.lock
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+.idea/
diff --git a/examples/applications/azure-document-ingestion/README.md b/examples/applications/azure-document-ingestion/README.md
@@ -1,7 +1,13 @@
-## Azure Blob Storage Ingestion
+# Real time RAG with LangChain, LangStream, AstraDB, and Azure Blob Storage Ingestion
 
-This sample application demonstrates how to create a continuous PDF document ingestion pipeline using an Azure blob storage container.
+This sample application demonstrates how to:
+- create a continuous PDF document ingestion pipeline using LangStream
+- use Retrieval Augmented Generation (RAG) with LangChain to inquire about the contents of the PDFs as they are ingested in real-time. 
 
+It uses our full stack (LangStream, AstraDB, and LangChain) with Azure Blob Storage for PDF ingestion. (Although it uses Azure, you could easily swap out Azure for AWS S3 or GCP Cloud Storage.)
+
+Here is what you will build:
+![Langstream UI in action](images/langstream_ui_intro.png)
 ### Features:
 - Create a Cassandra keyspace and table in AstraDB (if not already created).
 - Detect new PDF files uploaded to the specified Azure blob storage container.
@@ -10,6 +16,7 @@ This sample application demonstrates how to create a continuous PDF document ing
 - Extract information from the filename.
 - Generate a vector embedding for the text contents.
 - Write the results to the AstraDB table.
+- Perform RAG queries in the UI
 
 ## Setting Up AstraDB
 
@@ -21,20 +28,23 @@ This sample application demonstrates how to create a continuous PDF document ing
 1. **Login to DataStax Astra:** Navigate to [DataStax Astra](https://astra.datastax.com/) and log in.
 
 2. **Create a New Database:** Go to the `Databases` tab and click `Create Database`.
-
-3. **Configure Your Database:** Provide details like:
+![Creating a database](images/create_db.png)
+3. **Configure Your Database:** Provide the required information:
    - Database Name
    - Keyspace Name
    - Cloud Provider (Choose **Azure** for this guide)
    - Region
-
+Be sure to select Vector Database as the type at the top. 
+![Creating a database part 2](images/create_db2.png)
 4. **Database Initialization:** Wait for your database to be ready. Check status on the dashboard.
 
 5. **Connection:** Once ready, click `Connect` for connection details.
+![Connecting to the database](images/connecting_to_db.png)
 
-6. **Token Generation:** Navigate to `Settings` -> `Tokens` in the Astra DB console. Click `Generate Token` and save the generated token securely.
+6. **Token Generation:** Click "create a custom token" (or navigate to `Settings` -> `Tokens`) in the Astra DB console. Select the required permissions. For this tutorial, we will use Database Administrator as the role. 
+Click `Generate Token`. Save the generated credentials securely. You will not be able to retrieve them again.
 
-For a more detailed guide, check the [AstraDB setup guide](https://docs.datastax.com/en/astra-serverless/docs/manage/db/manage-create.html).
+Note: For a more detailed guide on creating your DB, check the [AstraDB setup guide](https://docs.datastax.com/en/astra-serverless/docs/manage/db/manage-create.html).
 
 ## Creating an Azure Blob Storage Container
 
@@ -67,12 +77,9 @@ az sshkey create --name azure --output-folder ~/.ssh
 ```
 Replace \`azure\` with your desired key name. This command creates \`azure\` (private key) and \`azure.pub\` (public key) in the `~/.ssh` directory.
 
-### Deploy PDFs for LangStream
-**Upload PDFs to Azure container:** Ensure files follow the convention `"${productName} ${productVersion}.pdf"` (e.g., "appleWatch 7.12-v4.pdf").
-
 ## Setting Up Secrets
 
-### Protect Your Secrets:
+### Update Secrets file:
 Ensure you protect your secrets and never upload them to source control. Update the [secrets file](../../secrets/secrets.yaml) for this tutorial or the secrets section of the [Terraform script](deployment.tf) based on your deployment method.
 
 ### Sample secrets.yaml:
@@ -84,8 +91,7 @@ secrets:
       access-key: "mykey"
       url: "https://mydomain.openai.azure.com"
       embeddings-model: "text-embedding-ada-002"
-      version: "2023-03-15-preview"
-      llm: "gpt-4"
+      provider: "azure"
   - id: astra
     data:
       token: "AstraCS:mytoken"
@@ -98,6 +104,15 @@ secrets:
       storage-access-key: "examplekey"
       storage-account-name: "exampleaccount"
       container: "examplecontainer"
+  - id: astra-langchain
+    data:
+      token: "AstraCS:mytoken"
+      database-id: "astra-db-uuid"
+      database: "exampledb"
+      keyspace: "examplekeyspace"
+      table: "exampletable"
+      clientId: "exampleClientId"
+      secret: "exampleSecret"
 ```
 
 If using Terraform, update the [Terraform deployment script](deployment.tf) as needed.
@@ -106,14 +121,51 @@ If using Terraform, update the [Terraform deployment script](deployment.tf) as n
 
 1. **Install LangStream:** Follow the [installation guide](https://github.com/LangStream/langstream#installation).
 
-2. **Start the Pipeline:** Ensure your secrets file is ready, and you have a PDF in the container. Then run:
+2. **Start the Pipeline:** Ensure your secrets file is ready.
+
+Warning: LangStream performs a destructive operation when consuming files from your blob storage container. Ensure that it's safe for the files in the storage container to be deleted after being successfully ingested!
+
+. Then run:
 
 ```
 cd examples/applications/azure-document-ingestion
 langstream docker run test -app . -s ../../secrets/secrets.yaml
 ```
+Wait for the UI to start and the logs to reach the state of "Nothing found".
+(Alternatively, if you have already put documents in your blob storage container, then you will see logs as LangStream processes the files.)
+![LangStream UI](images/langstream_ui.png)
+Any new PDFs added to the Azure blob storage container will be processed and saved in AstraDB, then deleted from the blob container.
 
-New PDFs added to the Azure blob storage container will be processed and saved in AstraDB, then deleted from the blob container.
+### Deploy PDFs for LangStream
+**Upload PDFs to Azure container:** For this tutorial, files will be mapped to our target table with the convention: `"${productName} ${productVersion}.pdf"` (e.g., "appleWatch 7.12-v4.pdf" will set productName as appleWatch and productVersion as 7.12-v4).
+
+**Upload PDFs**
+You can now upload PDFs to your Azure storage container to get them processed. As a reminder, the PDFs we used for this demo assume a naming convention starting with product-name version-name and then any other details, like "MyProduct DocVersion2 public report.pdf". 
+![Upload PDFs](images/upload_pdfs.png)
+Then, click Upload:
+![Upload PDFs part 2](images/upload_pdfs2.png)
+**Check results**
+To inspect the results, navigate to the CQL Console in the Astra UI and run:
+```SQL
+SELECT * FROM mykeyspace.mytable limit 2;
+```
+where mykeyspace and mytable are the keyspace and table used in your secrets.yaml file.
+![CQL Console image](images/cql_console.png)
+You should now see vectorized data in your table!
+
+## Test RAG
+The moment of truth - executing a Retrieval Augmented Generation (RAG) query that leverages your PDF data.
+### Connect via LangStream UI
+Return to the LangStream UI and click Connect.
+![Connecting to Langstream](images/connecting_to_langstream.png)
+### Ask a question
+Ask a question that can only be answered by the PDF(s) you uploaded.
+After you have asked your question, press enter.
+![Alt text](images/question_answer.png)
+
+## View flow (for fun)
+If you click the App tab, you can see a visual representation of your pipeline. Pretty cool, huh!
+![Pipeline flow](images/pipeline_flow.png)
 
 ### Run Terraform to deploy dependencies
 1. Review [this Terraform deployment script](deployment.tf).
@@ -144,7 +196,7 @@ terraform apply -auto-approve
 6. **Connect to the VM as needed to check that service is running as expected**
 ```bash
 chmod 600 /Users/`whoami`/.ssh/azure.pem
-ssh -i ~/.ssh/azure.pem adminuser@40.78.159.250 # substitute with the actual VM IP
+ssh -i ~/.ssh/azure.pem adminuser@40.83.57.12 # substitute with the actual VM IP
 ```
 7. **Verify service is running**
 ```bash
@@ -158,13 +210,4 @@ runuser -l adminuser -c '/home/adminuser/.langstream/candidates/current/bin/lang
 Additionally, you can check the cloud init output log on the VM to ensure the Terraform deployment ran correctly
 ```bash
 cat /var/log/cloud-init-output.log
-```
-9. **Upload PDFs**
-You can now upload PDFs to your Azure storage container to get them processed. As a reminder, the PDFs we used for this demo assume a naming convention starting with product-name version-name and then any other details, like "MyProduct DocVersion2 public report.pdf". 
-10. **Check results**
-To inspect the results, navigate to the CQL Console in the Astra UI and run:
-```SQL
-SELECT * FROM mykeyspace.mytable limit 2;
-```
-where mykeyspace and mytable are the keyspace and table used in your secrets.yaml file.
-You should now see vectorized data in your table!
+```
diff --git a/examples/applications/azure-document-ingestion/assets.yaml b/examples/applications/azure-document-ingestion/assets.yaml
@@ -0,0 +1,47 @@
+#
+# Copyright DataStax, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+assets:
+  - name: "langstream-keyspace"
+    asset-type: "astra-keyspace"
+    creation-mode: create-if-not-exists
+    config:
+      keyspace: "{{secrets.astra.keyspace}}"
+      datasource: "AstraDatasource"
+  - name: "langstream-docs-table"
+    asset-type: "cassandra-table"
+    creation-mode: create-if-not-exists
+    config:
+      table-name: "{{secrets.astra.table}}"
+      keyspace: "{{secrets.astra.keyspace}}"
+      datasource: "AstraDatasource"
+      create-statements:
+        - |
+          CREATE TABLE IF NOT EXISTS "{{secrets.astra.keyspace}}"."{{secrets.astra.table}}" (
+          row_id text PRIMARY KEY,
+          filename TEXT,
+          chunk_text_length TEXT,
+          chunk_num_tokens TEXT,
+          chunk_id TEXT,
+          attributes_blob text,
+          body_blob TEXT,
+          metadata_s map<text, text>,
+          name TEXT,
+          product_name TEXT,
+          product_version TEXT,
+          vector VECTOR<FLOAT, 1536>);
+        - |
+          CREATE CUSTOM INDEX IF NOT EXISTS {{secrets.astra.table}}_ann_index ON {{secrets.astra.keyspace}}.{{secrets.astra.table}}(vector) USING 'StorageAttachedIndex';