You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
or maybe use multiple capturing groups, without the . and ,
and use goup: join
As writing templates is very hard, I prefer it to make it as easy as possible.
The ideal regex template for the input string is: regex: VAT NUMBER\s+(\S+)
results in vat: ['NL.999,999.999,B01']
and then have a sanitazation function to strip out the unwanted characters.
As we know the value of the vat number should only contain digits and numbers we can replace all the rest. re.sub(r'\W+', '', vat)
results in vat: ['NL999999999B01']
What would be the best way to implement this in code?
fields:
vat:
parser: regex
regex: (NL\d{9}B\d{2})\s
type: str
# 1. Make replace function available on field level
replace: ['\W+', '']
# 2. Make a new santitize option
sanitize: any_word_character
Option 1: is still not easy to include in a template. But is is very powerfull and flexible.
Option 2: is easier to include in the template.
The text was updated successfully, but these errors were encountered:
This would be a very cool feature! Please also add it to the different plugins like 'tables' and 'lines'. Because suppliers have different naming for units, I want to be able to replace the units with the name I use in my system. This also means that you have to have a list of possible replacements. Below a part of the Farnell template with added the replace
I would like to propose a data cleansing / sanitazation step after matching.
as commented in: #106 (comment)
Use Case:
I would like to match a Netherlands vat number
Format: 'NL' + 9 digits + B + 2-digit company index – e.g. NL999999999B01
Which translates to:
Input string from OCR'd pdf:
VAT NUMBER NL.999,999.999,B01
We get the data, but it includes
.
and,
So the previous mentioned regex won't match 😞
Capturing something like that would need:
or maybe use multiple capturing groups, without the
.
and,
and use
goup: join
As writing templates is very hard, I prefer it to make it as easy as possible.
The ideal regex template for the input string is:
regex: VAT NUMBER\s+(\S+)
results in vat:
['NL.999,999.999,B01']
and then have a sanitazation function to strip out the unwanted characters.
As we know the value of the vat number should only contain digits and numbers we can replace all the rest.
re.sub(r'\W+', '', vat)
results in vat:
['NL999999999B01']
What would be the best way to implement this in code?
Option 1: is still not easy to include in a template. But is is very powerfull and flexible.
Option 2: is easier to include in the template.
The text was updated successfully, but these errors were encountered: