Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak while unlink! multithreaded !? #204

Open
Para7etamol opened this issue Nov 29, 2024 · 4 comments
Open

Memory leak while unlink! multithreaded !? #204

Para7etamol opened this issue Nov 29, 2024 · 4 comments

Comments

@Para7etamol
Copy link

Hi there,

I think there's a problem using unlink! in a multihreaded context. The real case is reading and modifying relatively big xml files (hundreds of megs) and after a while I am running out of memory and the process crashes. But one can a catch glimpse on the problem (not sure although) by adding this test:

@testset "multithreaded unlink!" begin
    valid_file = joinpath(dirname(@__FILE__), "sample1.xml")
    @info "free mem before readxml: $(Sys.free_memory() / 10^9)"
    @time begin
        Threads.@threads  for i in 1:5_000_000
            doc = readxml(valid_file)
            body_nodes = findall("/story/body", doc)
            EzXML.unlink!(body_nodes[1])
        end
    end
    @info "free mem before GC.gc(): $(Sys.free_memory() / 10^9)"
    GC.gc()
    @info "free mem after GC.gc(): $(Sys.free_memory() / 10^9)"

    @time begin
        Threads.@threads for i in 1:5_000_000
            doc = readxml(valid_file)
            body_nodes = findall("/story/body", doc)
            EzXML.unlink!(body_nodes[1])   
        end
    end
    @info "free mem before GC.gc(): $(Sys.free_memory() / 10^9)"
    GC.gc()
    @info "free mem after GC.gc(): $(Sys.free_memory() / 10^9)"
end

It seems to loose memory (~20GB) in the first loop, but does not consume more memory in the second loop.

[ Info: free mem before readxml: 123.12614912
  9.797982 seconds (20.38 M allocations: 941.318 MiB, 0.31% gc time, 50.85% compilation time)
[ Info: free mem before GC.gc(): 102.463021056
[ Info: free mem after GC.gc(): 102.915862528
  9.685490 seconds (20.02 M allocations: 917.150 MiB, 3.86% compilation time)
[ Info: free mem before GC.gc(): 101.926653952
[ Info: free mem after GC.gc(): 102.736408576

Without Threads.@threads it's different:

[ Info: free mem before readxml: 123.22193408
106.633343 seconds (20.00 M allocations: 915.562 MiB, 0.22% gc time, 0.01% compilation time)
[ Info: free mem before GC.gc(): 121.620242432
[ Info: free mem after GC.gc(): 121.569968128
110.108763 seconds (20.00 M allocations: 915.527 MiB, 0.23% gc time)
[ Info: free mem before GC.gc(): 120.292847616
[ Info: free mem after GC.gc(): 121.1896832

If I finalize the unlinked node after unlinking, free memory stays almost stable in both cases (although it's very slow of course).

Does unlink! require certain follow-up actions?

Please help :-)

Para

@Para7etamol Para7etamol changed the title Memory leak while unlink! multihreaded !? Memory leak while unlink! multithreaded !? Nov 29, 2024
@nhz2
Copy link
Member

nhz2 commented Nov 29, 2024

Yes, this package is not threadsafe.

@Para7etamol
Copy link
Author

Completely ? Because libxml2 has thread support as long as the different threads operate on different documents.
https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/threads.html
(Sadly my mwe above does not fulfill this condition, but my productive code does)

Concerning the second condition in the above link:
call xmlInitParser() in the "main" thread before using any of the libxml API (except possibly selecting a different memory allocator)
I didn't find a call to xmlInitParser in EzXML.jl so this is the reason why it's not threadsafe? Couldn't this be healed?

Or is the julia code in this package not thread safe?

@nhz2
Copy link
Member

nhz2 commented Dec 1, 2024

Yes, this package can be made threadsafe, but doing so will require significant work. To start, IIUC, the build script needs to be updated to "configure the library accordingly using the --with-threads options"
https://github.com/JuliaPackaging/Yggdrasil/blob/8141aa3972694d57a0db7e1d4bfb63610ac34a3e/X/XML2/build_tarballs.jl

Another option is to use multiprocessing instead of multithreading with a library like https://github.com/JuliaPluto/Malt.jl

You can also try https://github.com/JuliaComputing/XML.jl

I don't have the bandwidth to work on this right now, but any PR's to document good workarounds or fix multithreading issues are greatly appreciated.

@Para7etamol
Copy link
Author

Para7etamol commented Dec 1, 2024

Apparently --with-threads is the default of libxml2:
see https://github.com/GNOME/libxml2
--with-threads multithreading support (on)

So the only action needed may be calling xmlInitParser() in __init__().

I'm going to try that although having no experience in using C from Julia.

Thanks for the pointer to Malt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants