-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options to do table, form and query extraction using get_document_analysis #13
Comments
Great example for trying out table extraction: https://www.sos.ms.gov/elections-voting/2022-republican-primary - example PDF: https://www.sos.ms.gov/elections/electionResults/2022RepublicanPrimary/Adams.pdf |
Form extraction looks very interesting too: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html |
form extraction is super great! can confirm that when it works, it's pretty magical. |
I'm particularly interested in |
Here's my initial prototype for using queries config: diff --git a/s3_ocr/cli.py b/s3_ocr/cli.py
index d0741cd..78e5947 100644
--- a/s3_ocr/cli.py
+++ b/s3_ocr/cli.py
@@ -106,8 +106,11 @@ def cli():
"--dry-run", is_flag=True, help="Show what this would do, but don't actually do it"
)
@click.option("--no-retry", is_flag=True, help="Don't retry failed requests")
+@click.option(
+ "queries", "-q", "--query", multiple=True, help="Query to answer about the document"
+)
@common_boto3_options
-def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
+def start(bucket, keys, all, prefix, dry_run, no_retry, queries, **boto_options):
"""
Start OCR tasks for PDF files in an S3 bucket
@@ -163,13 +166,23 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
for item in items:
click.echo(item["Key"])
return
+
+ method = start_document_text_extraction
+ extra_kwargs = {}
+ if queries:
+ method = start_document_analysis
+ extra_kwargs["FeatureTypes"] = ["QUERIES"]
+ extra_kwargs["QueriesConfig"] = {
+ "Queries": [{"Text": query, "Pages": ["*"]} for query in queries]
+ }
+
for item in pdf_items:
key = item["Key"]
if key not in keys_with_s3_ocr_files:
sleep = 1
while True:
try:
- response = start_document_text_extraction(
+ response = method(
textract,
DocumentLocation={
"S3Object": {
@@ -181,6 +194,7 @@ def start(bucket, keys, all, prefix, dry_run, no_retry, **boto_options):
"S3Bucket": bucket,
"S3Prefix": "textract-output",
},
+ **extra_kwargs,
)
break
except textract.exceptions.LimitExceededException as ex:
@@ -574,3 +588,8 @@ def paginate(service, method, list_key, **kwargs):
def start_document_text_extraction(textract, **kwargs):
# Wrapper function to make this easier to mock in tests
return textract.start_document_text_detection(**kwargs)
+
+
+def start_document_analysis(textract, **kwargs):
+ # Wrapper function to make this easier to mock in tests
+ return textract.start_document_analysis(**kwargs) I ran it like this:
Against this PDF:https://go.boarddocs.com/il/gsd34/Board.nsf/files/BJTJ6F4B9FB4/$file/1_27_20%20Regular%20Board%20Agenda.pdf It produced this JSON file (it's big - I pretty printed it): https://gist.github.com/simonw/1898d5e07b656ef1c98dd851a0127d69 It seems to have all of the stuff I get for regular OCR, plus the following section at the bottom: [
{
"BlockType": "QUERY",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": null,
"EntityTypes": null,
"Geometry": null,
"Hint": null,
"Id": "54c47e41-b6ef-4399-8dd5-3d743953d0af",
"Page": 1,
"PageClassification": null,
"Query": {
"Alias": null,
"Pages": null,
"Text": "What are the approvals?"
},
"Relationships": [
{
"Ids": [
"e10b9d3d-fb79-4d49-a023-d67295d1daca"
],
"Type": "ANSWER"
}
],
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": null,
"TextType": null
},
{
"BlockType": "QUERY_RESULT",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 9,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.04619427025318146,
"Left": 0.23476172983646393,
"Top": 0.32628726959228516,
"Width": 0.5292477011680603
},
"Polygon": [
{
"X": 0.23476172983646393,
"Y": 0.3263976275920868
},
{
"X": 0.7639786005020142,
"Y": 0.32628726959228516
},
{
"X": 0.764009416103363,
"Y": 0.3723294138908386
},
{
"X": 0.23480695486068726,
"Y": 0.3724815547466278
}
]
},
"Hint": null,
"Id": "e10b9d3d-fb79-4d49-a023-d67295d1daca",
"Page": 1,
"PageClassification": null,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "Approval of Recording Secretary Pro Tem 2. Approval of Agenda 3. Approval of December 16, 2019 Executive Session and Regular Board Meeting Minutes (Tax Levy)",
"TextType": null
}
] The interesting bit is this:
That does indeed seem to answer my question! |
One problem to solve here is that But if we are asking questions we need to be able to ignore that, in order to re-submit new questions against documents that have been previously processed. May need to redesign the whole |
I tried asking three questions at once:
And got back https://gist.github.com/simonw/1bdadd2dcf341b9e8c75486f1c685f7a#file-three-questions-json-L14399 - which included answers for the first two questions and
|
Using these (more expensive) APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis
The text was updated successfully, but these errors were encountered: