Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM usage #32

Open
gabrik opened this issue Dec 4, 2024 · 12 comments
Open

RAM usage #32

gabrik opened this issue Dec 4, 2024 · 12 comments

Comments

@gabrik
Copy link

gabrik commented Dec 4, 2024

Hi, first thank you for this crate.

I've noticed an issue about memory usage when I push logs to loki using this crate.

I can see that the RAM continues to grow until the process gets killed for OOM.
Screenshot 2024-12-04 at 11 32 54

while when not using it I can see the RAM usage to be stable
Screenshot 2024-12-04 at 11 33 35

I can see the same behaviour using the Insturment application on macOS
Screenshot 2024-12-04 at 11 34 19

That's how I'm configuring it:

match (
        get_loki_endpoint(),
        get_loki_apikey(),
        get_loki_apikey_header(),
    ) {
        (Some(loki_url), Some(apikey), Some(header)) => {
            let builder = tracing_loki::builder();

            let builder = match metadata.as_object() {
                Some(metadata) => metadata
                    .into_iter()
                    .try_fold(builder, |builder, (k, v)| match v.as_str() {
                        Some(v) => builder.label(k, v),
                        None => Ok(builder),
                    }),
                None => Ok(builder),
            }?;

            let (loki_layer, task) = builder
                .http_header(header, apikey)?
                .build_url(Url::parse(&loki_url)?)?;

            tracing_sub.with(loki_layer).init();
            tokio::spawn(task);
            return Ok(());
        }
        _ => {
            tracing::warn!("Missing one of the required header for Loki!")
        }
    };
    ``` 
    
    any suggestion?
    
    if you need the Allocations trace file I can share it.
@gabrik
Copy link
Author

gabrik commented Dec 4, 2024

Maybe it comes from this line: https://github.com/hrxi/tracing-loki/blob/master/src/lib.rs#L513
has this queue seems just a couple of Vec<LokiEvent> and seems to just increase in size.

I see that here https://github.com/hrxi/tracing-loki/blob/master/src/lib.rs#L542 it should drop something, but given the memory usage it does not seem to happen

@hrxi
Copy link
Owner

hrxi commented Dec 4, 2024

Permalinks (press 'y' anywhere on GitHub):

Some(Some(item)) => self.queues[item.level].push(item),

if drop_outstanding {

Can you tell me a bit about your setup? How many log messages do you log per second? Do any logs make it to Loki?

@gabrik
Copy link
Author

gabrik commented Dec 4, 2024

Logs do reach Loki, and there I can see as up to as 50logs per second.

Loki is on the cloud, latency is around 100ms to reach the server.

btw I was digging on the code and I do not understand why the backgroud task does not send the event directly instead of pushing them into two Vec and start a task per HTTP request.

Seem to batch some of those logs, but as a results makes the whole architecture quite complex.

@hrxi
Copy link
Owner

hrxi commented Dec 4, 2024

btw I was digging on the code and I do not understand why the backgroud task does not send the event directly instead of pushing them into two Vec and start a task per HTTP request.

Seem to batch some of those logs, but as a results makes the whole architecture quite complex.

Yes, the idea is to have at most one outgoing HTTPS request at a time, and always push all available log messages when starting a new HTTPS request.

@gabrik
Copy link
Author

gabrik commented Dec 4, 2024

I tried to change the code to push the log line directly after the receiver.next() and the ram usage has reduced significatly.

From almost 100MB in 15m to 15MB in 15m and in general seems very stable.

I'm wondering if the BackgroudTask should run peridically, emptying the channel and send everything at once.
There still be one outstanding request with all the events that where in the channel.

OFC the period should be configurable.

@gabrik
Copy link
Author

gabrik commented Dec 4, 2024

Doing some pseudo code.

loop {
   sleep(period);
   let events = receiver.drain();
   let http_request = make_request(events);
   http_client.send(http_request)
}

Still max one outstanding request, but memory should be bounded by the channel size (that should be also configurable).

@gabrik
Copy link
Author

gabrik commented Dec 4, 2024

@hrxi I made the changes in #33 we can discuss them there.
But to me seems a nice improvement.

@ammario
Copy link

ammario commented Dec 6, 2024

FWIW Started using this crate today and experienced similar uncontrolled memory growth. Although I haven't yet proven the connection.

@gabrik
Copy link
Author

gabrik commented Dec 6, 2024

@ammario do you mind trying the fix I did in #33 and check if improves something also on your case?

it is enough to add this to your Cargo.toml

[patch.crates-io]
tracing-loki = { git = "https://github.com/gabrik/tracing-loki", branch = "fix/ram-usage" }

@hrxi
Copy link
Owner

hrxi commented Dec 9, 2024

Doing some pseudo code.

loop {
   sleep(period);
   let events = receiver.drain();
   let http_request = make_request(events);
   http_client.send(http_request)
}

This seems to produce some (IMO) unnecessary sleeping that delays some log events even if nothing else happens. I don't quite like introducing artificial latency.

I'll take a look at your PR.

@ammario
Copy link

ammario commented Dec 11, 2024

@ammario do you mind trying the fix I did in #33 and check if improves something also on your case?

it is enough to add this to your Cargo.toml

[patch.crates-io]
tracing-loki = { git = "https://github.com/gabrik/tracing-loki", branch = "fix/ram-usage" }

Been running it for a few days at some solid volumes (~10K TPM) without any issues. Thank you.

@gabrik
Copy link
Author

gabrik commented Dec 12, 2024

@hrxi this seems indeed to solve the issue, in the PR I made anyway the period configurable, we are also using it internally for a week now.
It would be possible to get a review on the PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants