Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak (even after webclient.close()) #729

Open
fleboulch opened this issue Feb 19, 2024 · 14 comments
Open

Memory leak (even after webclient.close()) #729

fleboulch opened this issue Feb 19, 2024 · 14 comments

Comments

@fleboulch
Copy link

Hello,

I want to thank you for your amazing work. I'm using your lib since almost 1 year now and it's really nice.

I'm having an issue about memory (heap memory).

Showcase 1: I'm starting my app without doing any scrap

Heap: 74Mo
image

Showcase 2: I'm starting my app and doing 1 scrap with close

Heap: 256Mo
image

class ArkeaArenaFetcher {

    fun fetch(): List<EventJpa> {
        val webClient = WebClient().apply {
            options.isCssEnabled = true
            options.isJavaScriptEnabled = true
            cssErrorHandler = SilentCssErrorHandler()
            javaScriptErrorListener = SilentJavaScriptErrorListener()
            options.isThrowExceptionOnFailingStatusCode = false
        }

        return try {

            val page = webClient.getPage<HtmlPage>("https://www.arkeaarena.com/fr/programmation/tous-les-evenements/#")
            webClient.waitForBackgroundJavaScript(4000)
            val container: HtmlElement = page.getFirstByXPath("//div[@class='events-list ajaxed']/div[@class='container']")
            val rawEvents = container.getByXPath<HtmlElement>("a")
            rawEvents.map(::htmlToInfra)
        } catch (e: Exception) {
            emptyList()
        } finally {
            webClient.close()
        }
    }
    
     private fun htmlToInfra(html: HtmlElement): EventJpa {
        // convert html to Kotlin object
        ...
     }


}

Showcase 3: I'm starting my app and doing 1 scrap with close + other clean + gc

Heap: 166Mo
image

The code is the same as the showcase 2 but only the finally clause is changing like below

        finally {
            webClient.cache.clear()
            webClient.topLevelWindows.forEach { it.close(false) }
            webClient.topLevelWindows.forEach { it.jobManager.removeAllJobs() }
            webClient.cookieManager.clearCookies()
            webClient.close()
            System.gc()
        }

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

I also tried

  • .use in Kotlin (try with resources) (article)

Other info

  • Language: Kotlin
  • HtmlUnit version: 3.9.0 (the behaviour is the same on older versions)

Article read about the memory subject:

Similar issues

@fleboulch fleboulch changed the title Memory leak (even after webclient.close() Memory leak (even after webclient.close()= Feb 19, 2024
@fleboulch fleboulch changed the title Memory leak (even after webclient.close()= Memory leak (even after webclient.close()) Feb 19, 2024
@rbri
Copy link
Member

rbri commented Feb 19, 2024

@fleboulch - first of all - great to see that this is of some use for you; thanks for the feedback

can you please add

webClient.cookieManager.clearCookies()

to your second case, because this is not part of the close process.

And can you please try HtmlUnit 3.11.0....

@rbri
Copy link
Member

rbri commented Feb 19, 2024

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

I think there is a lot of things that are created and stored - but i think the point is: if you create a webClient several times and do some scraping, after closing the client the memory should go back to the level after the first round....

@fleboulch
Copy link
Author

I would like to use 3.11.0 version but my suite test is failing since 3.10.0.
I added a comment here

@fleboulch
Copy link
Author

Yes you are correct! Even with a single webclient instance the memory is rising quite fast and in production I don't have a huge setup (1Go memory)

@fleboulch
Copy link
Author

fleboulch commented Feb 19, 2024

Second issue found when trying to migrate from 3.9.0 to 3.11.0 (comment)
Issue has been introduced in 3.10.0

@fleboulch
Copy link
Author

Hello @rbri,

I'm seeing you are preparing a 4.0.0 version. That's a great news !
Did you have time to check the regressions I mentionned in my comments here?

@fleboulch
Copy link
Author

fleboulch commented Apr 8, 2024

I tried v4.0.0 and regressions I mentionned earlier disappeared!
Thanks for the amazing work @rbri 🎉
Nevertheless, I still have my base issue with memory leak (I tried some stuff you told me above but it's not working)

@rbri
Copy link
Member

rbri commented May 12, 2024

@fleboulch sorry for the long pause

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

There are some internal (class based) caches that might be the reason.

I think a valid test scenario looks like this

  • warmup
    • create a new web client und open url A
    • waitForBackgroundJavascript for at least some seconds
    • close the web client
    • have a look at the memory
  • retry A
    • if you now open A again and again (new webclient, open A, close client), there should be no major change in the memory consumed after the client is close (and maybe a gc performed)
    • this should be the case at least if there are not that many changes on the page itself
  • try B
    • if you now open another web page in another client this might lead to another memory increase because more js classes are loaded or we have some more css to parse
  • retry B
    • this should not allocate significant more memory

So far the theory - will try to find some time to check the code again.

@fleboulch
Copy link
Author

Thanks for your reply @rbri ! I really appreciate your deep investigation.
I will try your scenario on my code to check if your assumptions are true.
I'm using different webclients because at the beginning I was parallelizing the calls

@fleboulch
Copy link
Author

fleboulch commented May 13, 2024

I checked your comment and it seems correct!
On my app I need to scrap multiple sites/external sources and I don't need any cache mechanism (even more after a close). I'm scrapping these websites once a day and currently the memory used stays high.
What are your recommandations for my use case?

@fleboulch
Copy link
Author

Hello @rbri,
Do you have some news about this issue?

@rbri
Copy link
Member

rbri commented Dec 1, 2024

@fleboulch sorry far too many things on my desk.
Will try to think a bit more about this during the next days. Sorry

@fleboulch
Copy link
Author

Happy new year! I will try to deep dive on this point but not sure to succeed

@rbri
Copy link
Member

rbri commented Jan 3, 2025

@fleboulch great, there is already some interesting stuff in the rhino queue to reduce the memory footprint a bit (mozilla/rhino#1783)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants