Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simdutf: simdutf_connector: in_tail: Implement UTF-16LE/UTF-16BE encoder #9468

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Oct 7, 2024

In Windows, there are lots of using UTF-16LE programs. This is because Unicode on Windows means UTF-16LE with BOM(Byte Order Mark).
In addition, there is lots of differences between UTF-16LE/UTF-16BE and UTF-8.
I added some of C, J and subdivision flags test cases for converting from UTF-16LE/UTF-16BE to UTF-8 in unit tests for in_tail plugin. This is because in_tail is the main usages to process non-UTF-8 encodings.
At first, we need to process UTF-16LE and UTF-16BE encodings.

Note that simdutf library is written in C++. So, we also provide an option (FLB_UNICODE_ENCODER) to turn on/off this feature.

Closes #9321


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
   flush           1
   log_level       trace

[INPUT]
   Name              tail
   Path              <path/to/non-UTF-16_encoded_file.log>
   Read_from_Head    True
   Unicode.Encoding  auto

[OUTPUT]
   Name  stdout
   Match *
  • Debug log output from testing the change
Fluent Bit v3.2.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _           _____  _____ 
|  ___| |                | |   | ___ (_) |         |____ |/ __  \
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`' / /'
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \  / /  
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /./ /___
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)_____/


[2024/10/07 16:01:31] [ info] Configuration:
[2024/10/07 16:01:31] [ info]  flush time     | 1.000000 seconds
[2024/10/07 16:01:31] [ info]  grace          | 5 seconds
[2024/10/07 16:01:31] [ info]  daemon         | 0
[2024/10/07 16:01:31] [ info] ___________
[2024/10/07 16:01:31] [ info]  inputs:
[2024/10/07 16:01:31] [ info]      tail
[2024/10/07 16:01:31] [ info] ___________
[2024/10/07 16:01:31] [ info]  filters:
[2024/10/07 16:01:31] [ info] ___________
[2024/10/07 16:01:31] [ info]  outputs:
[2024/10/07 16:01:31] [ info]      stdout.0
[2024/10/07 16:01:31] [ info] ___________
[2024/10/07 16:01:31] [ info]  collectors:
[2024/10/07 16:01:31] [ info] [fluent bit] version=3.2.0, commit=c9e98dac6a, pid=3617897
[2024/10/07 16:01:31] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/10/07 16:01:31] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/10/07 16:01:31] [ info] [cmetrics] version=0.9.6
[2024/10/07 16:01:31] [ info] [ctraces ] version=0.5.6
[2024/10/07 16:01:31] [ info] [input:tail:tail.0] initializing
[2024/10/07 16:01:31] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/10/07 16:01:31] [debug] [tail:tail.0] created event channels: read=25 write=26
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] inotify watch fd=31
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] scanning path unicode_c.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode unicode_c.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] inode=42736996 with offset=0 appended as unicode_c.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] scan_glob add(): unicode_c.log, inode 42736996
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] 1 new files found on path 'unicode_c.log'
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] scanning path ../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode ../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] inode=43166609 with offset=0 appended as ../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] scan_glob add(): ../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log, inode 43166609
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] 1 new files found on path '../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log'
[2024/10/07 16:01:31] [ info] [output:stdout:stdout.0] worker #0 started
[2024/10/07 16:01:31] [debug] [stdout:stdout.0] created event channels: read=34 write=35
[2024/10/07 16:01:31] [ info] [sp] stream processor started
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] [static files] processed 155b
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] inode=42736996 file=unicode_c.log promote to TAIL_EVENT
[2024/10/07 16:01:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=42736996 watch_fd=1 name=unicode_c.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] inode=43166609 file=../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log promote to TAIL_EVENT
[2024/10/07 16:01:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43166609 watch_fd=2 name=../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log
[2024/10/07 16:01:31] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2024/10/07 16:01:31] [debug] [task] created task=0x616b610 id=0 OK
[0] tail.0: [[1728284491.453318636, {}], {"log"=>"用汉字在 Fluent Bit 中处理日志,就像是一个梦一样😀"}]
[1] tail.0: [[1728284491.485253961, {}], {"log"=>"🏴󠁧󠁢󠁷󠁬󠁳󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁥󠁮󠁧󠁿"}]
[2024/10/07 16:01:31] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2024/10/07 16:01:31] [debug] [out flush] cb_destroy coro_id=0
[2024/10/07 16:01:31] [debug] [task] destroy task=0x616b610 (task_id=0)
^C[2024/10/07 16:01:32] [engine] caught signal (SIGINT)
[2024/10/07 16:01:32] [ warn] [engine] service will shutdown in max 5 seconds
[2024/10/07 16:01:32] [ info] [input] pausing tail.0
[2024/10/07 16:01:32] [ info] [engine] service has stopped (0 pending tasks)
[2024/10/07 16:01:32] [ info] [input] pausing tail.0
[2024/10/07 16:01:32] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2024/10/07 16:01:32] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2024/10/07 16:01:32] [debug] [input:tail:tail.0] inode=42736996 removing file name unicode_c.log
[2024/10/07 16:01:32] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=42736996 watch_fd=1
[2024/10/07 16:01:32] [debug] [input:tail:tail.0] inode=43166609 removing file name ../tests/runtime/data/tail/log/unicode_subdivision_flags_be.log
[2024/10/07 16:01:32] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43166609 watch_fd=2
  • Attached Valgrind output that shows no leaks or memory corruption was found
==3616809== 
==3616809== HEAP SUMMARY:
==3616809==     in use at exit: 0 bytes in 0 blocks
==3616809==   total heap usage: 3,183 allocs, 3,183 frees, 989,783 bytes allocated
==3616809== 
==3616809== All heap blocks were freed -- no leaks are possible
==3616809== 
==3616809== For lists of detected and suppressed errors, rerun with: -s
==3616809== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1471

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from d1b404a to 4053bbd Compare October 7, 2024 07:13
@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from 4053bbd to 2a515ea Compare October 7, 2024 07:17
From UTF-16LE, UTF-16BE and UTF-16LE with BOM, UTF-16BE with BOM to
UTF-8 are supported.
This could be useful for Windows' Unicode insisted logs.
They are usually using UTF-16LE with BOM.

Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
…s not fully support C++11

Signed-off-by: Hiroshi Hatake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required ok-package-test Run PR packaging tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for reading files encoded in UTF-16 for Tail Input
5 participants