Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to do table, form and query extraction using get_document_analysis #13

Open
simonw opened this issue Jun 30, 2022 · 7 comments
Open
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 30, 2022

Using these (more expensive) APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis

@simonw simonw added the enhancement New feature or request label Jun 30, 2022
@simonw
Copy link
Owner Author

simonw commented Jul 8, 2022

@simonw
Copy link
Owner Author

simonw commented Jul 14, 2022

Form extraction looks very interesting too: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html

@fgregg
Copy link

fgregg commented Aug 22, 2022

form extraction is super great! can confirm that when it works, it's pretty magical.

@simonw
Copy link
Owner Author

simonw commented Nov 2, 2022

I'm particularly interested in QueriesConfig - which lets you ask human-language queries of a document ("what is the date of the event?". https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.analyze_document

@simonw
Copy link
Owner Author

simonw commented Nov 2, 2022

Here's my initial prototype for using queries config:

diff --git a/s3_ocr/cli.py b/s3_ocr/cli.py
index d0741cd..78e5947 100644
--- a/s3_ocr/cli.py
+++ b/s3_ocr/cli.py
@@ -106,8 +106,11 @@ def cli():
     "--dry-run", is_flag=True, help="Show what this would do, but don't actually do it"
 )
 @click.option("--no-retry", is_flag=True, help="Don't retry failed requests")
+@click.option(
+    "queries", "-q", "--query", multiple=True, help="Query to answer about the document"
+)
 @common_boto3_options
-def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
+def start(bucket, keys, all, prefix, dry_run, no_retry, queries, **boto_options):
     """
     Start OCR tasks for PDF files in an S3 bucket
 
@@ -163,13 +166,23 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
         for item in items:
             click.echo(item["Key"])
         return
+
+    method = start_document_text_extraction
+    extra_kwargs = {}
+    if queries:
+        method = start_document_analysis
+        extra_kwargs["FeatureTypes"] = ["QUERIES"]
+        extra_kwargs["QueriesConfig"] = {
+            "Queries": [{"Text": query, "Pages": ["*"]} for query in queries]
+        }
+
     for item in pdf_items:
         key = item["Key"]
         if key not in keys_with_s3_ocr_files:
             sleep = 1
             while True:
                 try:
-                    response = start_document_text_extraction(
+                    response = method(
                         textract,
                         DocumentLocation={
                             "S3Object": {
@@ -181,6 +194,7 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
                             "S3Bucket": bucket,
                             "S3Prefix": "textract-output",
                         },
+                        **extra_kwargs,
                     )
                     break
                 except textract.exceptions.LimitExceededException as ex:
@@ -574,3 +588,8 @@ def paginate(service, method, list_key, **kwargs):
 def start_document_text_extraction(textract, **kwargs):
     # Wrapper function to make this easier to mock in tests
     return textract.start_document_text_detection(**kwargs)
+
+
+def start_document_analysis(textract, **kwargs):
+    # Wrapper function to make this easier to mock in tests
+    return textract.start_document_analysis(**kwargs)

I ran it like this:

s3-ocr start bln-boarddocs --all -q 'What are the approvals?'

Against this PDF:https://go.boarddocs.com/il/gsd34/Board.nsf/files/BJTJ6F4B9FB4/$file/1_27_20%20Regular%20Board%20Agenda.pdf

image

It produced this JSON file (it's big - I pretty printed it): https://gist.github.com/simonw/1898d5e07b656ef1c98dd851a0127d69

It seems to have all of the stuff I get for regular OCR, plus the following section at the bottom:

[
        {
            "BlockType": "QUERY",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "EntityTypes": null,
            "Geometry": null,
            "Hint": null,
            "Id": "54c47e41-b6ef-4399-8dd5-3d743953d0af",
            "Page": 1,
            "PageClassification": null,
            "Query": {
                "Alias": null,
                "Pages": null,
                "Text": "What are the approvals?"
            },
            "Relationships": [
                {
                    "Ids": [
                        "e10b9d3d-fb79-4d49-a023-d67295d1daca"
                    ],
                    "Type": "ANSWER"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        },
        {
            "BlockType": "QUERY_RESULT",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": 9,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 0.04619427025318146,
                    "Left": 0.23476172983646393,
                    "Top": 0.32628726959228516,
                    "Width": 0.5292477011680603
                },
                "Polygon": [
                    {
                        "X": 0.23476172983646393,
                        "Y": 0.3263976275920868
                    },
                    {
                        "X": 0.7639786005020142,
                        "Y": 0.32628726959228516
                    },
                    {
                        "X": 0.764009416103363,
                        "Y": 0.3723294138908386
                    },
                    {
                        "X": 0.23480695486068726,
                        "Y": 0.3724815547466278
                    }
                ]
            },
            "Hint": null,
            "Id": "e10b9d3d-fb79-4d49-a023-d67295d1daca",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": null,
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": "Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)",
            "TextType": null
        }
]

The interesting bit is this:

Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)

That does indeed seem to answer my question!

@simonw
Copy link
Owner Author

simonw commented Nov 2, 2022

One problem to solve here is that s3-ocr has a mechanism to avoid running the same OCR job more than once.

But if we are asking questions we need to be able to ignore that, in order to re-submit new questions against documents that have been previously processed.

May need to redesign the whole .s3-ocr.json mechanism to support this.

@simonw
Copy link
Owner Author

simonw commented Nov 2, 2022

I tried asking three questions at once:

s3-ocr start bln-boarddocs --all \
  -q 'What are the approvals?' \
  -q 'What date was the meeting?' \
  -q 'Who was at the meeting?'

And got back https://gist.github.com/simonw/1bdadd2dcf341b9e8c75486f1c685f7a#file-three-questions-json-L14399 - which included answers for the first two questions and null for the third (because the document doesn't actually say who was at the meeting):

  • "Text": "Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)"
  • "Text": "January 27, 2019"

@simonw simonw changed the title Options to do table extraction / other document analysis Options to do table, form and query extraction using get_document_analysis Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants