Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsers: lines: support multiple occurrence of blocks to parse #423

Merged
merged 1 commit into from
Oct 23, 2022

Conversation

rmilecki
Copy link
Collaborator

So far lines parser was looking for only 1 block defined by "start" and
"end" RegEx-es. Some invoices may have lines of the same set in muliple
blocks. They can be separated by some random content or page footer &
header.

To support such cases use "start" and "end" to find as many blocks to
parse as possible.

This is (hopefully) cleanly implemented by:
1. Renaming parse() to parse_block() and making it work with a single
    block (already extracted from invoice content)
2. Making new parse() find blocks one by one

This feature has been requested as a way of dealing with some multi-page
invoices.

@rmilecki rmilecki requested review from m3nu and bosd October 16, 2022 09:55
@rmilecki rmilecki self-assigned this Oct 16, 2022
@rmilecki
Copy link
Collaborator Author

rmilecki commented Oct 16, 2022

This feature was requested by @noxqs in the #372

I added new test to:

  1. Make this use case easier to understand
  2. Make sure it gets tested, it works and doesn't get broken later

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2022

Thanks for this contribution,
Since, there is already some work ongoing on the lines parser for multiline support in #378 and #417.
I will refrain from reviewing this, for now. until that functionality has been merged.
(will check then, if these code changes are compatible with each other)

One suggestion though, It would be nice to have some kind of feedback in the logger when when there is no match on the line start or line end regex. (This is also unavailable before this PR)

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2022

@rmilecki Can you fix the conflicts?

@rmilecki rmilecki force-pushed the lines-multiple-blocks branch 2 times, most recently from 402c23a to e366025 Compare October 22, 2022 13:34
@rmilecki
Copy link
Collaborator Author

@bosd: rebased & added warnings for parsing problems

@rmilecki
Copy link
Collaborator Author

@bosd: I see you rebased this again, my first thought was you want to merge or approve this. Can I ask if you had a chance to review those changes?

@bosd
Copy link
Collaborator

bosd commented Oct 23, 2022

@rmilecki I'm about to give my approval on this one.
My desire is to have the ability to parse multiple different blocks by the line parser.
propably better to have that in a separate pr, to keep small incemental updates. As this one is about parsing multiple similar blocks. Will make a separate pr, with the real invoice sample files, to show you the challenge I'm trying to attack.

Copy link
Collaborator

@bosd bosd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

So far lines parser was looking for only 1 block defined by "start" and
"end" RegEx-es. Some invoices may have lines of the same set in muliple
blocks. They can be separated by some random content or page footer &
header.

To support such cases use "start" and "end" to find as many blocks to
parse as possible.

This is (hopefully) cleanly implemented by:
1. Renaming parse() to parse_block() and making it work with a single
   block (already extracted from invoice content)
2. Making new parse() find blocks one by one

This feature has been requested as a way of dealing with some multi-page
invoices.

Signed-off-by: Rafał Miłecki <[email protected]>
@rmilecki rmilecki force-pushed the lines-multiple-blocks branch from 64783bd to ce652c5 Compare October 23, 2022 20:58
@rmilecki rmilecki merged commit 01dff0c into invoice-x:master Oct 23, 2022
@rmilecki
Copy link
Collaborator Author

Will make a separate pr, with the real invoice sample files, to show you the challenge I'm trying to attack.

Please do! Waiting for it!

@rmilecki rmilecki deleted the lines-multiple-blocks branch October 24, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants