-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update HTML tag parsing to work with Acunetix 360 #40
base: main
Are you sure you want to change the base?
Conversation
lib/acunetix/concerns/cleanup.rb
Outdated
result.gsub!(/<h[0-9] >(.*?)<\/h[0-9]>/) { "\n\n*#{$1.strip}*\n\n" } | ||
result.gsub!(/<b>(.*?)<\/b>/) { "*#{$1.strip}*" } | ||
result.gsub!(/<br\/>/, "\n") | ||
result.gsub!(/<br\/>|<br \/>/, "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you simplify this with an optional space?
lib/acunetix/concerns/cleanup.rb
Outdated
result.gsub!(/<span.*?>(.*?)<\/span>/m){"#{$1.strip}"} | ||
result.gsub!(/<span.*?>|<\/span>/, '') #repeating again to deal with nested/empty/incomplete span tags | ||
|
||
result.gsub!(/<a (.*?)href='(.*?)'><i(.*?)><\/i>(.*?)<\/a>/m) { "\"#{$4}\":#{$2}" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this one should cover the next one (L19) so we don't need multiple regex's to parse the same tag.
lib/acunetix/concerns/cleanup.rb
Outdated
|
||
result.gsub!(/<strong>(.*?)<\/strong>/) { "*#{$1.strip}*" } | ||
result.gsub!(/<span.*?>(.*?)<\/span>/m){"#{$1.strip}\n"} | ||
# Cleanup lingering <p></p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rachkor is this really an issue? We have this and L16 "cleanup" lines, is the code so bad that they include random <span>
and <p>
tags all over the place? It seems we're doing something wrong with our parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will only catch the nested <p>
tags and it doesn't seem like there's any in the sample files. Removed ✅
Summary
The following fields in Acunetix 360 Issues had a ton of HTML tags in the output but our parser wasn't removing them:
This PR resolves these 3 fields plus adds new cleanup for
<em>
tags and links with<i>
tags.