Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid generating JSON with duplicate keys #721

Open
nhmarujo opened this issue Dec 20, 2021 · 7 comments
Open

Avoid generating JSON with duplicate keys #721

nhmarujo opened this issue Dec 20, 2021 · 7 comments

Comments

@nhmarujo
Copy link

Describe the bug
Right now if someone adds a logstash marker with the same name as one of the default fields, it will generate a JSON with the two, which doesn't comply with JSON rules (no duplicate fields are allowed).

To Reproduce
Create a log line with a logstash marker with a name you know will collide with a default one and then observe the generated JSON. Example:

log.info(append("message", "This will generate another key 'message" on the JSON"), "This will be added on 'message' by default");

Expected behavior
Having an invalid JSON (with duplicate keys) will result in a log ingestion issue with many system (had it with fluentbit). I would expect that, when constructing the JSON, any fields that would result in duplicate keys to be dropped (only the first added would be included). Additionally I would suggest some sort of warning to be printed, to raise people's attention to the mapping mistake they introduced, so they would fix it ASAP.

  • logstash-logback-encoder 7.0.1
  • logback 1.2.7
  • jackson 2.13.0
  • java 17
@nhmarujo
Copy link
Author

nhmarujo commented Dec 21, 2021

Worth checking if the same won't happen with a field added from automatically from the MDC, for instance. Bottom line - with any field being added.

@brenuart
Copy link
Collaborator

Hi Nuno,

Glad you raised this issue... funny enough we were debating about this exact same issue last week with @philsttr and decided to wait until someone complains about it ;-)

Jackson already comes with a similar feature: JsonGenerator#Feature#STRICT_DUPLICATE_DETECTION. This feature is disabled by default but can easily be activated with a FeatureJsonGeneratorDecorator as described in https://github.com/logfellow/logstash-logback-encoder#customizing-json-factory-and-generator. When this feature is enabled Jackson throws an exception when attempting to write a duplicate property. This will cause LLE to drop the current event and emit a WARN logback status. This is of course not (exactly) what you expect... but at least you know from the statuses if you have duplicate keys.

We thought about different strategies on how to handle duplicates:

  • keep the first and drop subsequent duplicate entries
  • serialize duplicate entries under a different name (e.g. with an incremental suffix)
  • drop the entire event
  • etc

This strategy operates at runtime. Some duplicates may be detected earlier at configuration time (two JsonProviders with the same fieldName...). These cases should ideally be detected during startup as well.

Any additional thought? Your feedback is welcome...

PS: yes, MDC entries can cause duplicates as well...

@nhmarujo
Copy link
Author

nhmarujo commented Dec 21, 2021

Hi Bertrand,

Really happy to see such a warm reception to this issue 🙂

I think detecting these during startup should be really difficult since only when you try to build the messages you will be able to pinpoint if the issue will occur or not, no?

I think enabling JsonGenerator#Feature#STRICT_DUPLICATE_DETECTION, although not ideal, would already be better than the current behaviour. We would lose the log message, but we would get a warning to state what went wrong.

What I really fear is silent failure, which is what we experienced. In our case, only a single log line contained that issue, but it was enough to make almost 100% of the logs to be dropped and we were clueless about it. Why? Well, turns out fluentbit processes logs in batches and a single of these log entries inside one batch was enough to make the entire batch to fail and be dropped. And these log lines were quite frequent, so the chances of having at least one in each batch was quite high. If instead of having the logs with the duplicate entries fields we had a warning, I'm sure we would be able to detect it sooner 🙂

Honestly I like the incremental suffix idea. Is the one that presents most advantages:

  1. you don't lose the log entry
  2. you don't lose the marker/mdc entry/whatever
  3. it is clear, by seeing the suffix, that something is wrong

Wondering if it would still be worth to raise a warning in that situation anyway....

@brenuart
Copy link
Collaborator

I'm 100% with you...
Checking for duplicates has a cost and may reduce performance a bit. This feature would probably be disabled by default.

I'll try to focus on this issue as soon as I have some time. Stay tuned.

@nhmarujo
Copy link
Author

Thank you so much @brenuart . I will be watching closely 😉

@OksiBlack
Copy link

Hello, any news on the issue?

@antoniosg88
Copy link

we faced the same problem, any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants