Options to do table, form and query extraction using get_document_analysis #13

simonw · 2022-06-30T01:50:55Z

Using these (more expensive) APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis

simonw · 2022-07-08T16:32:07Z

Great example for trying out table extraction: https://www.sos.ms.gov/elections-voting/2022-republican-primary - example PDF: https://www.sos.ms.gov/elections/electionResults/2022RepublicanPrimary/Adams.pdf

simonw · 2022-07-14T16:08:02Z

Form extraction looks very interesting too: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html

fgregg · 2022-08-22T17:53:21Z

form extraction is super great! can confirm that when it works, it's pretty magical.

simonw · 2022-11-02T19:37:03Z

I'm particularly interested in QueriesConfig - which lets you ask human-language queries of a document ("what is the date of the event?". https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.analyze_document

simonw · 2022-11-02T19:57:00Z

Here's my initial prototype for using queries config:

diff --git a/s3_ocr/cli.py b/s3_ocr/cli.py
index d0741cd..78e5947 100644
--- a/s3_ocr/cli.py
+++ b/s3_ocr/cli.py
@@ -106,8 +106,11 @@ def cli():
     "--dry-run", is_flag=True, help="Show what this would do, but don't actually do it"
 )
 @click.option("--no-retry", is_flag=True, help="Don't retry failed requests")
+@click.option(
+    "queries", "-q", "--query", multiple=True, help="Query to answer about the document"
+)
 @common_boto3_options
-def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
+def start(bucket, keys, all, prefix, dry_run, no_retry, queries, **boto_options):
     """
     Start OCR tasks for PDF files in an S3 bucket
 
@@ -163,13 +166,23 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
         for item in items:
             click.echo(item["Key"])
         return
+
+    method = start_document_text_extraction
+    extra_kwargs = {}
+    if queries:
+        method = start_document_analysis
+        extra_kwargs["FeatureTypes"] = ["QUERIES"]
+        extra_kwargs["QueriesConfig"] = {
+            "Queries": [{"Text": query, "Pages": ["*"]} for query in queries]
+        }
+
     for item in pdf_items:
         key = item["Key"]
         if key not in keys_with_s3_ocr_files:
             sleep = 1
             while True:
                 try:
-                    response = start_document_text_extraction(
+                    response = method(
                         textract,
                         DocumentLocation={
                             "S3Object": {
@@ -181,6 +194,7 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
                             "S3Bucket": bucket,
                             "S3Prefix": "textract-output",
                         },
+                        **extra_kwargs,
                     )
                     break
                 except textract.exceptions.LimitExceededException as ex:
@@ -574,3 +588,8 @@ def paginate(service, method, list_key, **kwargs):
 def start_document_text_extraction(textract, **kwargs):
     # Wrapper function to make this easier to mock in tests
     return textract.start_document_text_detection(**kwargs)
+
+
+def start_document_analysis(textract, **kwargs):
+    # Wrapper function to make this easier to mock in tests
+    return textract.start_document_analysis(**kwargs)

I ran it like this:

s3-ocr start bln-boarddocs --all -q 'What are the approvals?'

Against this PDF:https://go.boarddocs.com/il/gsd34/Board.nsf/files/BJTJ6F4B9FB4/$file/1_27_20%20Regular%20Board%20Agenda.pdf

It produced this JSON file (it's big - I pretty printed it): https://gist.github.com/simonw/1898d5e07b656ef1c98dd851a0127d69

It seems to have all of the stuff I get for regular OCR, plus the following section at the bottom:

[
        {
            "BlockType": "QUERY",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "EntityTypes": null,
            "Geometry": null,
            "Hint": null,
            "Id": "54c47e41-b6ef-4399-8dd5-3d743953d0af",
            "Page": 1,
            "PageClassification": null,
            "Query": {
                "Alias": null,
                "Pages": null,
                "Text": "What are the approvals?"
            },
            "Relationships": [
                {
                    "Ids": [
                        "e10b9d3d-fb79-4d49-a023-d67295d1daca"
                    ],
                    "Type": "ANSWER"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        },
        {
            "BlockType": "QUERY_RESULT",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": 9,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 0.04619427025318146,
                    "Left": 0.23476172983646393,
                    "Top": 0.32628726959228516,
                    "Width": 0.5292477011680603
                },
                "Polygon": [
                    {
                        "X": 0.23476172983646393,
                        "Y": 0.3263976275920868
                    },
                    {
                        "X": 0.7639786005020142,
                        "Y": 0.32628726959228516
                    },
                    {
                        "X": 0.764009416103363,
                        "Y": 0.3723294138908386
                    },
                    {
                        "X": 0.23480695486068726,
                        "Y": 0.3724815547466278
                    }
                ]
            },
            "Hint": null,
            "Id": "e10b9d3d-fb79-4d49-a023-d67295d1daca",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": null,
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": "Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)",
            "TextType": null
        }
]

The interesting bit is this:

Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)

That does indeed seem to answer my question!

simonw · 2022-11-02T20:02:40Z

One problem to solve here is that s3-ocr has a mechanism to avoid running the same OCR job more than once.

But if we are asking questions we need to be able to ignore that, in order to re-submit new questions against documents that have been previously processed.

May need to redesign the whole .s3-ocr.json mechanism to support this.

simonw · 2022-11-02T20:57:08Z

I tried asking three questions at once:

s3-ocr start bln-boarddocs --all \
  -q 'What are the approvals?' \
  -q 'What date was the meeting?' \
  -q 'Who was at the meeting?'

And got back https://gist.github.com/simonw/1bdadd2dcf341b9e8c75486f1c685f7a#file-three-questions-json-L14399 - which included answers for the first two questions and null for the third (because the document doesn't actually say who was at the meeting):

"Text": "Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)"
"Text": "January 27, 2019"

simonw added the enhancement New feature or request label Jun 30, 2022

simonw changed the title ~~Options to do table extraction / other document analysis~~ Options to do table, form and query extraction using get_document_analysis Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options to do table, form and query extraction using get_document_analysis #13

Options to do table, form and query extraction using get_document_analysis #13

simonw commented Jun 30, 2022

simonw commented Jul 8, 2022

simonw commented Jul 14, 2022

fgregg commented Aug 22, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022

Options to do table, form and query extraction using get_document_analysis #13

Options to do table, form and query extraction using get_document_analysis #13

Comments

simonw commented Jun 30, 2022

simonw commented Jul 8, 2022

simonw commented Jul 14, 2022

fgregg commented Aug 22, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022

simonw commented Nov 2, 2022