Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential for URI parsing performance improvement. #151

Closed
samoconnor opened this issue Jan 19, 2018 · 3 comments
Closed

Potential for URI parsing performance improvement. #151

samoconnor opened this issue Jan 19, 2018 · 3 comments

Comments

@samoconnor
Copy link
Contributor

HTTP.jl uses the http_parser_parse_url function to parse URLs.

function http_parser_parse_url(url::String)

I believe this code is based on ngx_http_parse.c from NGINX. @quinnj is that right?

I recently added some more URI parsing tests based on https://github.com/cweb/url-testing/blob/master/urls.json and in the process of debugging made a simple regex pattern based on the regex from RFC 3986.

It turns out that the simple regex parser is faster than http_parser_parse_url.

Running test/uri_benchmark.jl shows that the regex parser runs in 47% of the time taken by http_parser_parse_url:

  3.058562 seconds (19.64 M allocations: 748.444 MiB, 2.00% gc time)
http_parser_parse_url parsed 204 urls 10000 times in 3059.0 ms
  1.436758 seconds (18.69 M allocations: 1.159 GiB, 6.28% gc time)
regex_parse parsed 204 urls 10000 times in 1437.0 ms (47.0%)

The regex parser is in URIs.jl here:

HTTP.jl/src/URIs.jl

Lines 101 to 121 in 6ee7083

const uri_reference_regex =
r"""^
(?: ([^:/?#]+) :) ? # 1. sheme
(?: // (?: ([^/?#@]*) @) ? # 2. userinfo
(?| (?: \[ ([^\]]+) \] ) # 3. host (ipv6)
| ([^:/?#\[]*) ) # 3. host
(?: : ([^/?#]+) ) ? ) ? # 4. port
([^?#]*) # 5. path
(?: \?([^#]*) ) ? # 6. query
(?: [#](.*) ) ? # 7. fragment
$"""x
const empty = SubString("", 1, 0)
function regex_parse(::Type{URI}, str::AbstractString)
m = match(uri_reference_regex, str)
if m == nothing
return emptyuri
end
return URI(str, (c = m[1]) == nothing ? empty : c,

@samoconnor
Copy link
Contributor Author

Hacking the benchmark script to use a pre #135 version of HTTP.jl, and removing a few unicode URLs that the old HTTP.jl could not handle, produces a very similar result:

  3.003586 seconds (48.00 M allocations: 2.367 GiB, 10.48% gc time)
http_parser_parse_url parsed 192 urls 10000 times in 3004.0 ms
  1.449821 seconds (17.56 M allocations: 1.090 GiB, 6.50% gc time)
regex_parse parsed 192 urls 10000 times in 1450.0 ms (48.3%)

So, this doesn't look like a regression caused by #135.

The only advantage I can see in the slower http_parser_parse_url function is that it's error messages will say "encountered invalid url character" and tell you which character was unexpected. Whereas I assume the regex will simply not capture any characters for expression groups that don't match.

samoconnor added a commit that referenced this issue Jan 23, 2018
 - Remove args -> string -> parse -> URI round-trip from constructors & merge()
 - Use parse_uri_reference() instead of slower http_parser_parse_url()
@samoconnor
Copy link
Contributor Author

With this change URI parsing is 2 x faster again: 0aef98a

  3.130020 seconds (19.64 M allocations: 748.444 MiB, 2.20% gc time)
http_parser_parse_url parsed 204 urls 10000 times in 3130.0 ms
  0.640859 seconds (10.53 M allocations: 470.428 MiB, 5.82% gc time)
regex_parse parsed 204 urls 10000 times in 641.0 ms (20.5%)

@samoconnor
Copy link
Contributor Author

Something in latest v0.7 has made the old http_parser_parse_url parser even slower.
But the new code is fast still:

188.405103 seconds (152.43 M allocations: 21.468 GiB, 1.07% gc time)
http_parser_parse_url parsed 204 urls 10000 times in 188397.0 ms
  0.778016 seconds (10.53 M allocations: 470.428 MiB, 6.09% gc time)
regex_parse parsed 204 urls 10000 times in 778.0 ms (0.4%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant