Skip to content

Commit

Permalink
parsers: lines: support multiple occurrence of blocks to parse
Browse files Browse the repository at this point in the history
So far lines parser was looking for only 1 block defined by "start" and
"end" RegEx-es. Some invoices may have lines of the same set in muliple
blocks. They can be separated by some random content or page footer &
header.

To support such cases use "start" and "end" to find as many blocks to
parse as possible.

This is (hopefully) cleanly implemented by:
1. Renaming parse() to parse_block() and making it work with a single
   block (already extracted from invoice content)
2. Making new parse() find blocks one by one

This feature has been requested as a way of dealing with some multi-page
invoices.

Signed-off-by: Rafał Miłecki <[email protected]>
  • Loading branch information
Rafał Miłecki authored and bosd committed Oct 22, 2022
1 parent 4acb2df commit 64783bd
Show file tree
Hide file tree
Showing 4 changed files with 124 additions and 15 deletions.
53 changes: 38 additions & 15 deletions src/invoice2data/extract/parsers/lines.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,10 @@ def parse_line(patterns, line):
return None


def parse(template, field, _settings, content):
"""Try to extract lines from the invoice"""

# First apply default options.
settings = DEFAULT_OPTIONS.copy()
settings.update(_settings)

def parse_block(template, field, settings, content):
# Validate settings
assert "start" in settings, "Lines start regex missing"
assert "end" in settings, "Lines end regex missing"
assert "line" in settings, "Line regex missing"

start = re.search(settings["start"], content)
end = re.search(settings["end"], content)
if not start or not end:
logger.warning(f"No lines found. Start match: {start}. End match: {end}")
return
content = content[start.end() : end.start()]
lines = []
current_row = {}

Expand Down Expand Up @@ -131,6 +117,43 @@ def parse(template, field, _settings, content):
return lines


def parse(template, field, _settings, content):
# First apply default options.
settings = DEFAULT_OPTIONS.copy()
settings.update(_settings)

# Validate settings
assert "start" in settings, "Lines start regex missing"
assert "end" in settings, "Lines end regex missing"

blocks_count = 0
lines = []

# Try finding & parsing blocks of lines one by one
while True:
start = re.search(settings["start"], content)
if not start:
break
content = content[start.end():]

end = re.search(settings["end"], content)
if not end:
logger.warning("Failed to find lines block end")
break

blocks_count += 1
lines += parse_block(template, field, settings, content[0:end.start()])

content = content[end.end():]

if blocks_count == 0:
logger.warning("Failed to find any matching block (part) of invoice for \"%s\"", field)
elif not lines:
logger.warning("Failed to find any lines for \"%s\"", field)

return lines


def parse_current_row(match, current_row):
# Parse the current row data
for field, value in match.groupdict().items():
Expand Down
17 changes: 17 additions & 0 deletions tests/custom/lines-blocks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[
{
"issuer": "Lines Tests",
"date": "2022-10-15",
"invoice_number": "1234/10/2022",
"amount": 99.99,
"lines": [
{ "pos": 1, "name": "Cat" },
{ "pos": 2, "name": "Dog" },
{ "pos": 3, "name": "Frog" },
{ "pos": 4, "name": "Lizard" },
{ "pos": 5, "name": "Unicorn" }
],
"currency": "EUR",
"desc": "Invoice from Lines Tests"
}
]
39 changes: 39 additions & 0 deletions tests/custom/lines-blocks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Issue date: 2022-10-15
Issuer: Lines Tests
Invoice number: 1234/10/2022
Total: 99.99 EUR

Lines in multiple blocks

Lines start
1. Cat
2. Dog
Lines end

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus quis metus sagittis, fermentum
risus et, vulputate orci. Curabitur id pellentesque mi, vel euismod nulla. Morbi tincidunt ipsum
eu volutpat dictum. Nam hendrerit varius mauris, a venenatis ligula lacinia et. Sed blandit
lobortis facilisis. Donec efficitur metus ac sapien luctus, eget facilisis dolor eleifend. In sapien
erat, vestibulum in sollicitudin a, euismod nec nunc.

Lines start
3. Frog
Lines end

Nulla elit dui, dictum in augue ac, rutrum mollis risus. In hac habitasse platea dictumst. Phasellus
quis eros ac elit iaculis vehicula et vel nunc. Aenean consequat in velit vel luctus. Proin vel
sapien cursus, ultrices turpis vel, fringilla dolor. Vestibulum ex leo, ullamcorper a quam quis,
molestie convallis est. Nulla egestas posuere purus, eget viverra elit dapibus et. Pellentesque
habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis posuere eros
dui.

Lines start
4. Lizard
5. Unicorn
Lines end

In varius nulla arcu, ac interdum velit ornare vel. Mauris a placerat lacus. Nam porta metus eget
arcu mattis, non iaculis elit luctus. Etiam rutrum volutpat arcu, vitae semper turpis mollis id.
Fusce orci dui, pellentesque et ipsum eget, pellentesque luctus leo. Nullam non mollis mi. In
semper, ex sed mollis dapibus, lectus metus vestibulum turpis, vitae convallis mauris eros in orci.
Interdum et malesuada fames ac ante ipsum primis in faucibus.
30 changes: 30 additions & 0 deletions tests/custom/templates/lines-blocks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# -*- coding: utf-8 -*-
# SPDX-License-Identifier: MIT
issuer: Lines Tests
keywords:
- Lines Tests
- Lines in multiple blocks
fields:
date:
parser: regex
regex: Issue date:\s*(\d{4}-\d{2}-\d{2})
type: date
invoice_number:
parser: regex
regex: Invoice number:\s*([\d/]+)
amount:
parser: regex
regex: Total:\s*(\d+\.\d\d)
type: float
lines:
parser: lines
start: Lines start
end: Lines end
line: ^(?P<pos>\d+)\.\s+(?P<name>.+)$
types:
pos: int
options:
currency: EUR
date_formats:
- '%Y-%m-%d'
decimal_separator: '.'

0 comments on commit 64783bd

Please sign in to comment.