Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thai character set order misalignment when parsing or writing string #30683

Closed
chrisbloe opened this issue Mar 16, 2022 · 3 comments
Closed

Thai character set order misalignment when parsing or writing string #30683

chrisbloe opened this issue Mar 16, 2022 · 3 comments
Assignees
Labels

Comments

@chrisbloe
Copy link

chrisbloe commented Mar 16, 2022

Terraform Version

I'm running on Ubuntu under WSL:

Terraform v1.1.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/local v2.2.2

Terraform Configuration Files

The following content variable contains 3 Thai characters:

resource "local_file" "test" {
  content = "ฺุู"
  filename = "./created.txt"
}

When pasted it should look like this (e.g. Notepad++):
image

Debug Output

I haven't done this, but it's fairly trivial to reproduce.

Expected Behavior

The 3 characters should stay in the same order.

Actual Behavior

The last character moves to the front of the text:
image

Steps to Reproduce

  1. terraform init
  2. terraform apply

Additional Context

I discovered this by doing some testing with archive_file as seen below. One of the files I was zipping was langthaimodel.py (see References) - I narrowed it down to the three characters included in the above example.

I tested with local_file and saw it had the same problem, so I think this is a problem with Terraform itself, or a library or platform, and not a specific provider.

# This doesn't exhibit the problem
data "archive_file" "zip" {
  type             = "zip"
  output_file_mode = "0777"
  source_dir       = "${path.module}/files/stuff"
  output_path      = "${path.module}/files/stuff.zip"
}

# This does exhibit the problem
data "archive_file" "zip_TEST" {
  type        = "zip"
  output_path = "${path.module}/files/stuff_TEST.zip"

  dynamic "source" {
    for_each = fileset("${path.module}/files/stuff", "**")

    content {
      content  = file("${path.module}/files/stuff/${source.value}")
      filename = source.value
    }
  }
}

References

@chrisbloe chrisbloe added bug new new issue not yet triaged labels Mar 16, 2022
@kmoe kmoe self-assigned this Mar 16, 2022
@apparentlymart
Copy link
Contributor

apparentlymart commented Mar 16, 2022

Hi @chrisbloe! Thanks for reporting this.

In the Terraform language, strings are treated as "unicode text" and so subject to normalization per UAX #15 so that Terraform will treat as equal any two strings that have the same normalization. This behavior is particularly important in situations where Terraform needs to compare strings for its own work, such as deciding whether a string given in the configuration is equal to a string returned by the remote system which might encode the same characters using different (but equivalent) sequences of unicode code points.

Although I'm not familiar with these characters in particular, from how they are rendered it seems like they are characters intended to combine with what comes before them, and characters of that class are the ones typically most affected by unicode text normalization1, because in practice the preceding code point and all of the combining code points that follow are a single "user-perceived character" as far as Unicode is concerned, and so from a unicode text perspective there is no significance to which order the individual combining codepoints combine.

Given that, I think Terraform is working as intended here (indirectly: it is relying on Unicode specifications which themselves seem to intend this behavior), but it raises the question of what you could've done differently here in order to preserve your input byte-for-byte rather than interpreting it as unicode text, because your source file seems to be intentionally representing particular non-canonical Unicode sequences for another piece of software which presumably implements Unicode specifications (or something similar) itself.

The typical way we represent raw bytes in the Terraform language is as base64-encoded strings, and the function filebase64 is an equivalent of file that doesn't try to interpret the file context as UTF-8 encoded unicode text and instead just returns a base64 encoding of the raw bytes.

However, the archive_file data source (a part of the hashicorp/archive provider, rather than of Terraform Core itself) seems to lack any way to specify the content of a file in base64 form, which I would typically expect for functionality like this in order to complete the path from reading the file to passing it into the provider without forcing interpretation as unicode text. Perhaps an answer here would be to extend the hashicorp/archive provider so that the source block has an optional content_base64 argument, mutually-exclusive with content, which you could then pass the result of filebase64 in order to put the literal content of that file into your .zip archive:

  content_base64 = filebase64("${path.module}/files/stuff/${source.value}")

If we take that approach then it would be a change in the repository of the hashicorp/archive provider rather than in this repository, but I won't proactively open an issue over there just yet because there might be other options to consider for how to proceed here.

Thanks again for reporting this!


1 UAX #15 Section 1.3 includes the following, which I think is the most relevant part to explain the behavior you observed:

Once a string has been fully decomposed, any sequences of combining marks that it contains are put into a well-defined order. This rearrangement of combining marks is done according to a subpart of the Unicode Normalization Algorithm known as the Canonical Ordering Algorithm. That algorithm sorts sequences of combining marks based on the value of their Canonical_Combining_Class (ccc) property, whose values are also defined in UnicodeData.txt. Most characters (including all non-combining marks) have a Canonical_Combining_Class value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters are referred to by a special term, starter. Only the subset of combining marks which have non-zero Canonical_Combining_Class property values are subject to potential reordering by the Canonical Ordering Algorithm. Those characters are called non-starters.

Following this specification's terminology, a different way to understand my statement above is that I think the characters you identified here are "non-starters" and therefore subject to reordering during normalization. We should be able to confirm that by referring to how those characters are annotated in UnicodeData.txt. (The Canonical Combining Classes.)

@chrisbloe
Copy link
Author

Thanks for such a quick reply! It seems this issue is a red herring for another problem I've been having, but your answer may well help others if they run up against this in the future.

@crw crw added question and removed bug new new issue not yet triaged labels Mar 18, 2022
@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants