Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Merged
merged 13 commits into from
Jan 12, 2025

Conversation

paulirwin
Copy link
Contributor

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT

Fixes #1076

Description

This adds an extension method to make it easy to match Java's CodingErrorAction.REPORT use in CharSet where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.

This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with IOUtils.GetDecodingReader which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.

@paulirwin paulirwin added the notes:improvement An enhancement to an existing feature label Jan 6, 2025
@paulirwin paulirwin requested a review from NightOwl888 January 6, 2025 19:36
paulirwin and others added 3 commits January 6, 2025 16:50
…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.
Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I went ahead and pushed an optimization to the Term.ToString() method, since System.Text.Unicode.Utf8 supports fallback without an exception.

I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM to:

    private static UTF8Encoding CreateUtf8Encoding()
    {
        // Create a UTF8Encoding instance
        UTF8Encoding encoding = new UTF8Encoding(
            encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
            throwOnInvalidBytes: false             // Match Java's silent handling of invalid bytes
        );

        // Set the DecoderFallback to use '\uFFFD' as the replacement character
        encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");

        // Set the EncoderFallback to use '\uFFFD' for invalid character encoding
        encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");

        return encoding;
    }

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

Note that the fallback character has been defined in other places in the codebase as ?, as well. Those should be reviewed.

Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.

src/Lucene.Net/Support/Text/EncodingExtensions.cs Outdated Show resolved Hide resolved
@paulirwin
Copy link
Contributor Author

@NightOwl888 In regards to:

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

This is only true for Encoding.ASCII (at least amongst those defined on Encoding). Encoding.UTF8 by default returns \uFFFD (from csharprepl):

> Encoding.ASCII.GetString(new byte[] { 0xc3 })
"?"
> Encoding.UTF8.GetString(new byte[] { 0xc3 })
"�"
> Encoding.UTF8.GetString(new byte[] { 0xc3 })[0] == '\ufffd'
true

I confirmed this is the case on net462 as well.

The only place in non-test code where Encoding.ASCII is used is in ConnectionCostsBuilder, and that doesn't matter because this PR changes it to throw on invalid characters anyways.

Given that the rest of the places we use UTF8, I don't think we need to make that change.

@NightOwl888 NightOwl888 self-requested a review January 10, 2025 18:30
Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a more exhaustive review and I found some additional issues that we should address.

Exception Handling

We need to fix the ExceptionExtensions class so that any existing catch blocks will account for gaps between encoding fallback exceptions between Java and .NET.

I already checked where we are catching ArgumentException explicitly, and I think we are good there.

  • IsIOException() should be changed to include both DecoderFallbackException and EncoderFallbackException.
  • IsIllegalArgumentException() should be changed to exclude both DecoderFallbackException and EncoderFallbackException.

Explicit Encoding Char Replacement

As I previously mentioned, there are a few places where we are using ? as the replacement character instead of \uFFFD. I located them and they are all in tests.

  • // This is essentially the equivalent of
    // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    // .onUnmappableCharacter(CodingErrorAction.REPLACE)
    // .onMalformedInput(CodingErrorAction.REPLACE);
    //
    // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    // new EncoderReplacementFallback("?"),
    // new DecoderReplacementFallback("?"));
    for (int i = 0; i < n; i++)
    {
    Random.NextBytes(buffer);
    int size = 1 + Random.Next(50);
    // This test is turning random bytes into a string,
    // this is asking for trouble.
    Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    new EncoderReplacementFallback("?"),
    new DecoderReplacementFallback("?"));
    string s = decoder.GetString(buffer, 0, size);
    array.Append(s);
    builder.Append(s);
    }
    for (int i = 0; i < n; i++)
    {
    Random.NextBytes(buffer);
    int size = 1 + Random.Next(50);
    // This test is turning random bytes into a string,
    // this is asking for trouble.
    Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    new EncoderReplacementFallback("?"),
    new DecoderReplacementFallback("?"));
    string s = decoder.GetString(buffer, 0, size);
    array.Append(s);
    builder.Append(s);
    }
    for (int i = 0; i < n; i++)
    {
    Random.NextBytes(buffer);
    int size = 1 + Random.Next(50);
    // This test is turning random bytes into a string,
    // this is asking for trouble.
    Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    new EncoderReplacementFallback("?"),
    new DecoderReplacementFallback("?"));
  • // This is essentially the equivalent of
    // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    // .onUnmappableCharacter(CodingErrorAction.REPLACE)
    // .onMalformedInput(CodingErrorAction.REPLACE);
    //
    // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    // new EncoderReplacementFallback("?"),
    // new DecoderReplacementFallback("?"));
    Random random = Random;
    for (int i = 0; i < numUniqueValues;)
    {
    random.NextBytes(buffer);
    int size = 1 + random.Next(buffer.Length);
    // This test is turning random bytes into a string,
    // this is asking for trouble.
    Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
    new EncoderReplacementFallback("?"),
    new DecoderReplacementFallback("?"));

While technically, they are fine because they are in tests, during porting what often happens when trying to figure out how to do something is to search the codebase to see how it was done elsewhere. And leaving it like this could cause it to be copied incorrectly elsewhere.

So, let's change those tests to use StandardCharsets.UTF_8 and add a comment to indicate that configuring CodingErrorAction.REPLACE explicitly is unnecessary because it is the default behavior of StandardCharsets.UTF_8 (in .NET, anyway).

Default Encoding

During review, I noticed that we have translated Charset.defaultCharset() to Encoding.GetEncoding(0). It would be better to change this to Encoding.Default. We have comments to indicate Encoding.GetEncoding(0) represents the system default in most cases, but using the property would make it clearer.

ChatGPT recommends using Encoding.Default instead of Encoding.GetEncoding(0) because apparently the latter will throw an exception in some environments on .NET Core. I tried on my system using net8.0, but it didn't throw. However, the property definitely won't throw even if ChatGPT is wrong about this.

Default Encoding with System.Console

Note that there is one place where the default encoding wasn't ported correctly.

This line can be changed to:

using TextWriter logger = new StreamWriter(Console.OpenStandardOutput(), Encoding.Default) { AutoFlush = true };

ISO 8859-14 Encoding

This encoding (only the decoder) was ported from Hunspell in Lucene because neither .NET nor Java have an implementation.

Caching

Unlike the other encodings it is not cached in .NET. Let's turn ISO8859_14Encoding into a singleton by adding a static Default property.

Fallback Behavior

I ran some tests in Java to find that this neither throws nor replaces invalid input characters. Instead, it simply leaves them as-is. I checked the iso-8859-1 charset (which the Lucene implementation of iso-8859-14 extends) and it does the same thing by default in Java.

But since the user is supplying dictionaries (usually standardized ones that are downloaded from websites that gather them) it makes me wonder whether Lucene should be throwing on invalid input characters. Certainly, leaving them as-is is not a typical option in a .NET Encoding class. Maybe we should report this upstream to find out whether this was intentional or an oversight.

NOTE: Hunspell was among the most complicated parts to port in the Lucene.Net.Analysis.Common package, so it might be worth the extra effort to at least make sure users aren't feeding it invalid input.

Also, I downloaded all of the Hunspell dictionaries that Lucene uses in tests and put them in a repository here: https://github.com/NightOwl888/lucenenet-hunspell. These are not part of the main repository because it is a huge amount of binary data. The tests currently must be run manually after downloading the files. IIRC, there are still a few dictionaries that don't pass the tests in Lucene.NET, but I don't see that as a showstopper for the release because most of them work.

GB2312 Encoding

Encoding.GetEncoding("GB2312") is being called inside of a tight loop in multiple places in Lucene.Net.Analysis.SmartCn. While the instance returned from Encoding.GetEncoding("GB2312") is cached by the encoding provider, it still needs to do a dictionary lookup every time it is called. Let's change it to statically cache this instance in AbstractDictionary so the dictionary lookup only occurs once.

src/Lucene.Net/Support/Text/EncodingExtensions.cs Outdated Show resolved Hide resolved
@paulirwin paulirwin marked this pull request as draft January 11, 2025 04:06
@paulirwin
Copy link
Contributor Author

The change to the tests exposed something else to fix: Java's standard Charsets will throw on malformed input and unmappable characters by default, meaning CodingErrorAction.REPORT is already the default. In .NET, the default encodings replace instead of throw, but our StandardCharsets.UTF_8/IOUtils.ENCODING_UTF_8_NO_BOM is already set up to match Java in this behavior. That means we do need to explicitly use DecoderReplacementFallback for those tests mentioned above in "Explicit Encoding Char Replacement" if we use StandardCharsets.UTF_8, and will likely need another extension method. But more importantly, this means that any place in Lucene that uses a Charset without specifying CodingErrorAction.REPLACE, we need to use one that throws (such as our StandardCharsets.UTF_8 support value), or else we are not throwing on invalid byte sequences when Lucene is (i.e. if we use Encoding.UTF8). I'll get this change made tomorrow.

@NightOwl888
Copy link
Contributor

It appears that there is a difference between the default when using:

  • StandardCharsets.UTF_8
  • StandardCharsets.UTF_8.newDecoder()

The default behavior of the latter is CodingErrorAction.REPORT, but the same is not true when using the former (which is how Lucene normally passes it to other APIs).

  @Test
  public void testDefaultUTF8DecoderWithInvalidBytes() {
      // Define a mix of valid and invalid UTF-8 bytes
      byte[] testBytes = {
              (byte) 0xE2, (byte) 0x82, (byte) 0xAC, // Valid UTF-8 for €
              (byte) 0xC3, (byte) 0xA9,             // Valid UTF-8 for é
              (byte) 0xFF,                          // Invalid byte
              (byte) 0x41,                          // Valid ASCII for 'A'
              (byte) 0x80,                          // Invalid byte
              (byte) 0xF0, (byte) 0x9F, (byte) 0x92, (byte) 0x96 // Valid UTF-8 for 💖
      };

      // Decode the bytes using the default UTF-8 behavior
      String decodedString = new String(testBytes, StandardCharsets.UTF_8);

      // Print each character and its code point
      System.out.println("Decoded String Code Points: ");
      for (int i = 0; i < decodedString.length(); i++) {
          System.out.printf("Character at position %d: %d%n", i, (int) decodedString.charAt(i));
      }
      
      // Verify the decoded string (replacement character appears at invalid bytes)
      assertEquals("€é\uFFFDA\uFFFD💖", decodedString);
  }

@paulirwin
Copy link
Contributor Author

@NightOwl888 Good catch, thanks. I'll look out for any newDecoder/newEncoder usage that might use the defaults.

@paulirwin paulirwin marked this pull request as ready for review January 11, 2025 19:09
@paulirwin paulirwin requested a review from NightOwl888 January 11, 2025 19:09
@paulirwin
Copy link
Contributor Author

@NightOwl888 Updated with feedback. There were no other cases where newDecoder/newEncoder were used that needed to be addressed.

@NightOwl888 NightOwl888 self-requested a review January 12, 2025 03:34
Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@paulirwin paulirwin merged commit 4bf492c into apache:master Jan 12, 2025
267 checks passed
@paulirwin paulirwin deleted the issue/1076 branch January 12, 2025 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes:improvement An enhancement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review use of Encoding fallback handling
2 participants