Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader #410

Conversation

yuvalyaron
Copy link
Collaborator

@yuvalyaron yuvalyaron commented Mar 18, 2024

Resolves #321

Prior to these changes, the accelerator used LangChain's Azure Document Intelligence implementation, which is fairly basic and only returns the raw response from Document Intelligence's API.
In addition to that, instead of using the more effective "prebuilt-layout" model, the code used the weaker "prebuilt-read" model which uses simple OCR instead of analyzing the documents with AI.

These changes allow for more flexibility in loading the documents and include:

  • Formatting of tables (to make tables more readable for the LLM)
  • Support for excluding parts of the file that are not relevant for the LLM (such as page numbers and footers)
  • Removal of recurring patterns in the file with regex (for example, sometimes Document Intelligence adds ":selected:" to the content).
  • Ability to load the file into 1 LangChain document (like LangChain's Azure Document Intelligence implementation does) or into 1 document per page (like other LangChain loaders do) - this is important for use cases where we want to get the relevant page number
  • Support for parallel processing of files
  • Using the simpler "prebuilt-read" model as a fallback as sometimes the more complex model fails to load the files

Checklist:

  • Does all the steps gets executed without errors with sentence boundary based chunking?
  • Does all the steps gets executed without errors with document intelligence based chunking?
  • Does all the steps gets executed without errors using Azure Machine Learning Cluster?
  • Does all the steps gets executed without errors using a sample of data?
  • Have you updated or added unit tests for changes applicable to the PR?
  • Does all the unit tests are getting passed?
  • Have you updated the readme.md file based on the changes in the PR?
  • Have you reverted back the config.json to its original state

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader
@ritesh-modi ritesh-modi self-requested a review March 19, 2024 09:12
@ritesh-modi ritesh-modi requested a review from shivam-51 April 3, 2024 08:47
Copy link
Collaborator

@ritesh-modi ritesh-modi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@shivam-51 shivam-51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. LGTM. Please fix the conflicts.

…ment-accelerator into yuval/add-azure-document-intelligence-prebuilt-layout-loader
@yuvalyaron yuvalyaron merged commit c3ad1bb into development Apr 3, 2024
3 checks passed
@yuvalyaron yuvalyaron deleted the yuval/add-azure-document-intelligence-prebuilt-layout-loader branch April 3, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace Langchain's Azure-Document-Intelligence Loader with a custom loader
3 participants