Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Azure doc ingestion example to include RAG chatbot #712

Merged
merged 13 commits into from
Nov 13, 2023
149 changes: 149 additions & 0 deletions examples/applications/azure-document-ingestion/.langstreamignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# .langstreamignore file inspired by https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# These folders hold the libs built for the target
# and we need them in the package
!python/lib/
!java/lib/

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
Pipfile.lock

# poetry
poetry.lock

# pdm
pdm.lock
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
.idea/
95 changes: 69 additions & 26 deletions examples/applications/azure-document-ingestion/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
## Azure Blob Storage Ingestion
# Real time RAG with LangChain, LangStream, AstraDB, and Azure Blob Storage Ingestion

This sample application demonstrates how to create a continuous PDF document ingestion pipeline using an Azure blob storage container.
This sample application demonstrates how to:
- create a continuous PDF document ingestion pipeline using LangStream
- use Retrieval Augmented Generation (RAG) with LangChain to inquire about the contents of the PDFs as they are ingested in real-time.

It uses our full stack (LangStream, AstraDB, and LangChain) with Azure Blob Storage for PDF ingestion. (Although it uses Azure, you could easily swap out Azure for AWS S3 or GCP Cloud Storage.)

Here is what you will build:
![Langstream UI in action](images/langstream_ui_intro.png)
### Features:
- Create a Cassandra keyspace and table in AstraDB (if not already created).
- Detect new PDF files uploaded to the specified Azure blob storage container.
Expand All @@ -10,6 +16,7 @@ This sample application demonstrates how to create a continuous PDF document ing
- Extract information from the filename.
- Generate a vector embedding for the text contents.
- Write the results to the AstraDB table.
- Perform RAG queries in the UI

## Setting Up AstraDB

Expand All @@ -21,20 +28,23 @@ This sample application demonstrates how to create a continuous PDF document ing
1. **Login to DataStax Astra:** Navigate to [DataStax Astra](https://astra.datastax.com/) and log in.

2. **Create a New Database:** Go to the `Databases` tab and click `Create Database`.

3. **Configure Your Database:** Provide details like:
![Creating a database](images/create_db.png)
3. **Configure Your Database:** Provide the required information:
- Database Name
- Keyspace Name
- Cloud Provider (Choose **Azure** for this guide)
- Region

Be sure to select Vector Database as the type at the top.
![Creating a database part 2](images/create_db2.png)
4. **Database Initialization:** Wait for your database to be ready. Check status on the dashboard.

5. **Connection:** Once ready, click `Connect` for connection details.
![Connecting to the database](images/connecting_to_db.png)

6. **Token Generation:** Navigate to `Settings` -> `Tokens` in the Astra DB console. Click `Generate Token` and save the generated token securely.
6. **Token Generation:** Click "create a custom token" (or navigate to `Settings` -> `Tokens`) in the Astra DB console. Select the required permissions. For this tutorial, we will use Database Administrator as the role.
Click `Generate Token`. Save the generated credentials securely. You will not be able to retrieve them again.

For a more detailed guide, check the [AstraDB setup guide](https://docs.datastax.com/en/astra-serverless/docs/manage/db/manage-create.html).
Note: For a more detailed guide on creating your DB, check the [AstraDB setup guide](https://docs.datastax.com/en/astra-serverless/docs/manage/db/manage-create.html).

## Creating an Azure Blob Storage Container

Expand Down Expand Up @@ -67,12 +77,9 @@ az sshkey create --name azure --output-folder ~/.ssh
```
Replace \`azure\` with your desired key name. This command creates \`azure\` (private key) and \`azure.pub\` (public key) in the `~/.ssh` directory.

### Deploy PDFs for LangStream
**Upload PDFs to Azure container:** Ensure files follow the convention `"${productName} ${productVersion}.pdf"` (e.g., "appleWatch 7.12-v4.pdf").

## Setting Up Secrets

### Protect Your Secrets:
### Update Secrets file:
Ensure you protect your secrets and never upload them to source control. Update the [secrets file](../../secrets/secrets.yaml) for this tutorial or the secrets section of the [Terraform script](deployment.tf) based on your deployment method.

### Sample secrets.yaml:
Expand All @@ -84,8 +91,7 @@ secrets:
access-key: "mykey"
url: "https://mydomain.openai.azure.com"
embeddings-model: "text-embedding-ada-002"
version: "2023-03-15-preview"
llm: "gpt-4"
provider: "azure"
- id: astra
data:
token: "AstraCS:mytoken"
Expand All @@ -98,6 +104,15 @@ secrets:
storage-access-key: "examplekey"
storage-account-name: "exampleaccount"
container: "examplecontainer"
- id: astra-langchain
data:
token: "AstraCS:mytoken"
database-id: "astra-db-uuid"
database: "exampledb"
keyspace: "examplekeyspace"
table: "exampletable"
clientId: "exampleClientId"
secret: "exampleSecret"
```

If using Terraform, update the [Terraform deployment script](deployment.tf) as needed.
Expand All @@ -106,14 +121,51 @@ If using Terraform, update the [Terraform deployment script](deployment.tf) as n

1. **Install LangStream:** Follow the [installation guide](https://github.com/LangStream/langstream#installation).

2. **Start the Pipeline:** Ensure your secrets file is ready, and you have a PDF in the container. Then run:
2. **Start the Pipeline:** Ensure your secrets file is ready.

Warning: LangStream performs a destructive operation when consuming files from your blob storage container. Ensure that it's safe for the files in the storage container to be deleted after being successfully ingested!

. Then run:

```
cd examples/applications/azure-document-ingestion
langstream docker run test -app . -s ../../secrets/secrets.yaml
```
Wait for the UI to start and the logs to reach the state of "Nothing found".
(Alternatively, if you have already put documents in your blob storage container, then you will see logs as LangStream processes the files.)
![LangStream UI](images/langstream_ui.png)
Any new PDFs added to the Azure blob storage container will be processed and saved in AstraDB, then deleted from the blob container.

New PDFs added to the Azure blob storage container will be processed and saved in AstraDB, then deleted from the blob container.
### Deploy PDFs for LangStream
**Upload PDFs to Azure container:** For this tutorial, files will be mapped to our target table with the convention: `"${productName} ${productVersion}.pdf"` (e.g., "appleWatch 7.12-v4.pdf" will set productName as appleWatch and productVersion as 7.12-v4).

**Upload PDFs**
You can now upload PDFs to your Azure storage container to get them processed. As a reminder, the PDFs we used for this demo assume a naming convention starting with product-name version-name and then any other details, like "MyProduct DocVersion2 public report.pdf".
![Upload PDFs](images/upload_pdfs.png)
Then, click Upload:
![Upload PDFs part 2](images/upload_pdfs2.png)
**Check results**
To inspect the results, navigate to the CQL Console in the Astra UI and run:
```SQL
SELECT * FROM mykeyspace.mytable limit 2;
```
where mykeyspace and mytable are the keyspace and table used in your secrets.yaml file.
![CQL Console image](images/cql_console.png)
You should now see vectorized data in your table!

## Test RAG
The moment of truth - executing a Retrieval Augmented Generation (RAG) query that leverages your PDF data.
### Connect via LangStream UI
Return to the LangStream UI and click Connect.
![Connecting to Langstream](images/connecting_to_langstream.png)
### Ask a question
Ask a question that can only be answered by the PDF(s) you uploaded.
After you have asked your question, press enter.
![Alt text](images/question_answer.png)

## View flow (for fun)
If you click the App tab, you can see a visual representation of your pipeline. Pretty cool, huh!
![Pipeline flow](images/pipeline_flow.png)

### Run Terraform to deploy dependencies
1. Review [this Terraform deployment script](deployment.tf).
Expand Down Expand Up @@ -144,7 +196,7 @@ terraform apply -auto-approve
6. **Connect to the VM as needed to check that service is running as expected**
```bash
chmod 600 /Users/`whoami`/.ssh/azure.pem
ssh -i ~/.ssh/azure.pem adminuser@40.78.159.250 # substitute with the actual VM IP
ssh -i ~/.ssh/azure.pem adminuser@40.83.57.12 # substitute with the actual VM IP
```
7. **Verify service is running**
```bash
Expand All @@ -158,13 +210,4 @@ runuser -l adminuser -c '/home/adminuser/.langstream/candidates/current/bin/lang
Additionally, you can check the cloud init output log on the VM to ensure the Terraform deployment ran correctly
```bash
cat /var/log/cloud-init-output.log
```
9. **Upload PDFs**
You can now upload PDFs to your Azure storage container to get them processed. As a reminder, the PDFs we used for this demo assume a naming convention starting with product-name version-name and then any other details, like "MyProduct DocVersion2 public report.pdf".
10. **Check results**
To inspect the results, navigate to the CQL Console in the Astra UI and run:
```SQL
SELECT * FROM mykeyspace.mytable limit 2;
```
where mykeyspace and mytable are the keyspace and table used in your secrets.yaml file.
You should now see vectorized data in your table!
```
47 changes: 47 additions & 0 deletions examples/applications/azure-document-ingestion/assets.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

assets:
- name: "langstream-keyspace"
asset-type: "astra-keyspace"
creation-mode: create-if-not-exists
config:
keyspace: "{{secrets.astra.keyspace}}"
datasource: "AstraDatasource"
- name: "langstream-docs-table"
asset-type: "cassandra-table"
creation-mode: create-if-not-exists
config:
table-name: "{{secrets.astra.table}}"
keyspace: "{{secrets.astra.keyspace}}"
datasource: "AstraDatasource"
create-statements:
- |
CREATE TABLE IF NOT EXISTS "{{secrets.astra.keyspace}}"."{{secrets.astra.table}}" (
row_id text PRIMARY KEY,
filename TEXT,
chunk_text_length TEXT,
chunk_num_tokens TEXT,
chunk_id TEXT,
attributes_blob text,
body_blob TEXT,
metadata_s map<text, text>,
name TEXT,
product_name TEXT,
product_version TEXT,
vector VECTOR<FLOAT, 1536>);
- |
CREATE CUSTOM INDEX IF NOT EXISTS {{secrets.astra.table}}_ann_index ON {{secrets.astra.keyspace}}.{{secrets.astra.table}}(vector) USING 'StorageAttachedIndex';
Loading
Loading