Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

yuvalyaron · 2024-03-18T17:58:18Z

Resolves #321

Prior to these changes, the accelerator used LangChain's Azure Document Intelligence implementation, which is fairly basic and only returns the raw response from Document Intelligence's API.
In addition to that, instead of using the more effective "prebuilt-layout" model, the code used the weaker "prebuilt-read" model which uses simple OCR instead of analyzing the documents with AI.

These changes allow for more flexibility in loading the documents and include:

Formatting of tables (to make tables more readable for the LLM)
Support for excluding parts of the file that are not relevant for the LLM (such as page numbers and footers)
Removal of recurring patterns in the file with regex (for example, sometimes Document Intelligence adds ":selected:" to the content).
Ability to load the file into 1 LangChain document (like LangChain's Azure Document Intelligence implementation does) or into 1 document per page (like other LangChain loaders do) - this is important for use cases where we want to get the relevant page number
Support for parallel processing of files
Using the simpler "prebuilt-read" model as a fallback as sometimes the more complex model fails to load the files

Checklist:

Does all the steps gets executed without errors with sentence boundary based chunking?
Does all the steps gets executed without errors with document intelligence based chunking?
Does all the steps gets executed without errors using Azure Machine Learning Cluster?
Does all the steps gets executed without errors using a sample of data?
Have you updated or added unit tests for changes applicable to the PR?
Does all the unit tests are getting passed?
Have you updated the readme.md file based on the changes in the PR?
Have you reverted back the config.json to its original state

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

…-prebuilt-layout-loader

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

…-prebuilt-layout-loader

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

…oader' of https://github.com/microsoft/rag-experiment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

ritesh-modi

LGTM

shivam-51

Thanks for the PR. LGTM. Please fix the conflicts.

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

yuvalyaron added 13 commits March 12, 2024 15:47

make document intelligence loader use prebuilt-layout

ee4fa17

add option to split by page to doc intelligence loader

1525a5a

add unit tests

3d0fdf8

add unit test

7f01c5e

add unit test

f7b59b0

remove formatting of spanning rows

a440d2d

add unit test for tables without headers

87c0680

add unit tests for multipages

4196209

fix test

ebb7b52

add test for excluding roles

735318a

add test for get_file_paths

1dac9f0

fix format

c531668

Merge branch 'development' of https://github.com/microsoft/rag-experi…

77f37d0

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

yuvalyaron linked an issue Mar 18, 2024 that may be closed by this pull request

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #321

Closed

Merge branch 'development' of https://github.com/microsoft/rag-experi…

bc33025

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

ritesh-modi self-requested a review March 19, 2024 09:12

yuvalyaron and others added 10 commits March 19, 2024 09:30

remove duplicate dependency

92ebab5

Merge branch 'development' into yuval/add-azure-document-intelligence…

ab2ce9e

…-prebuilt-layout-loader

Merge branch 'development' of https://github.com/microsoft/rag-experi…

5e03673

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

Merge branch 'development' of https://github.com/microsoft/rag-experi…

6541f8a

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

fix conflicts with development

ebf96f8

Merge branch 'development' of https://github.com/microsoft/rag-experi…

b8a3857

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

Merge branch 'development' into yuval/add-azure-document-intelligence…

f400e84

…-prebuilt-layout-loader

Merge branch 'development' of https://github.com/microsoft/rag-experi…

63dbdea

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

Merge branch 'yuval/add-azure-document-intelligence-prebuilt-layout-l…

fc632c5

…oader' of https://github.com/microsoft/rag-experiment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

remove progress bar

916d7e0

ritesh-modi requested a review from shivam-51 April 3, 2024 08:47

ritesh-modi approved these changes Apr 3, 2024

View reviewed changes

shivam-51 approved these changes Apr 3, 2024

View reviewed changes

Merge branch 'development' of https://github.com/microsoft/rag-experi…

cc21def

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader

Update README

4622b68

yuvalyaron merged commit c3ad1bb into development Apr 3, 2024
3 checks passed

yuvalyaron deleted the yuval/add-azure-document-intelligence-prebuilt-layout-loader branch April 3, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

yuvalyaron commented Mar 18, 2024 •

edited

Loading

ritesh-modi left a comment

shivam-51 left a comment

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

Conversation

yuvalyaron commented Mar 18, 2024 • edited Loading

Resolves #321

ritesh-modi left a comment

Choose a reason for hiding this comment

shivam-51 left a comment

Choose a reason for hiding this comment

yuvalyaron commented Mar 18, 2024 •

edited

Loading