Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LC_ALL Environment Variable Not Applied to sort Command in Galaxy Sort Tool #18967

Closed
lybCNU opened this issue Oct 10, 2024 · 6 comments
Closed

Comments

@lybCNU
Copy link

lybCNU commented Oct 10, 2024

Describe the bug
The LC_ALL environment variable is only applied to sed when the Galaxy sort tool is used with a non-zero number of header lines. This leads to inconsistent sorting results as the LC_ALL environment variable is not applied to the sort command.

This bug was discovered while using the Data Manipulation Olympics tutorial , created by @shiltemann, from training.galaxyproject.org as a teaching resource.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: [24.1.2.dev0] at https://usegalaxy.eu/

To Reproduce
Steps to reproduce the behavior:

  1. Use the Galaxy sort tool.
  2. Set Number of header lines to 1.
  3. Observe the generated command:
    ( LC_ALL=C sed -u '1'q && sort --stable -t '	' -k '12n,12' -k '2,2' ) < '/data/dnb10/galaxy_db/files/0/e/4/dataset_0e43156d-5e72-4eb0-b590-6718f5fa7fd2.dat' > '/data/jwd05e/main/074/257/74257660/outputs/dataset_fc8da397-9f0b-4106-b2e4-0932b7ad446e.dat'
  4. Set Number of header lines to 0 and observe the following generated command:
    ( LC_ALL=C sort --stable -t '	' -k '12n,12' -k '2,2' ) < '/data/dnb10/galaxy_db/files/0/e/4/dataset_0e43156d-5e72-4eb0-b590-6718f5fa7fd2.dat' > '/data/jwd02f/main/074/257/74257664/outputs/dataset_1ff92d15-59bd-4c08-83c5-60136ebc601b.dat'
  5. Notice the inconsistent sorting results: with Number of header lines set to 1, the athlete listed at the top is Adolf Schmal instead of A. Grigoriadis.
    image
    Expected behavior
    The LC_ALL environment variable should be applied to both sed and sort commands, ensuring consistent sorting results. The expected command should be:
    ( LC_ALL=C sed -u '1'q && LC_ALL=C sort --stable -t '	' -k '12n,12' -k '2,2' )
    After sorting, the first athlete should be A. Grigoriadis.

Additional context
The issue occurs because LC_ALL=C is not applied to the sort command when Number of header lines is set to 1. This inconsistency results in incorrect sorting behavior. Applying LC_ALL=C to the entire command resolves the issue. The test dataset used is olympics.tsv.

@bernt-matthias
Copy link
Contributor

Good find @lybCNU!

@bgruening can you transfer the issue to your repo? It refers to https://github.com/bgruening/galaxytools/blob/master/tools/text_processing/text_processing/sort.xml

I guess just adding a semicolon after LC_ALL=C should also do the trick.. Maybe export to be sure 🤔

@bgruening
Copy link
Member

I only see repos from this organisation, not my repo. I don't get why a semicolon should help. We need to make sure we have a test for this.

@lybCNU
Copy link
Author

lybCNU commented Oct 10, 2024

Currently, I have only tested on usegalaxy.eu and usegalaxy.org, and the bug was only found on usegalaxy.eu. By accessing the command line via Jupyter notebook, I observed that on usegalaxy.org, echo $LC_ALL returns C, while on usegalaxy.eu, it returns en_US.UTF-8. This could be the cause. I also tested on usegalaxy.eu with Jupyter command line, and adding a semicolon does seem to solve the issue.

According to ChatGPT, the reason is that with (LC_ALL=C; sed -u '1'q && sort), LC_ALL=C is applied globally within the parentheses, so it affects both sed and sort. In contrast, (LC_ALL=C sed -u '1'q && sort) only applies LC_ALL to sed, so sort does not inherit it.

@bgruening
Copy link
Member

This makes sense. But we have LC_ALL set for every command ( LC_ALL=C sed -u '1'q && LC_ALL=C sort --stable -t ' ' -k '12n,12' -k '2,2' ) so before sed and before sort.

@lybCNU
Copy link
Author

lybCNU commented Oct 10, 2024

@bgruening

That makes sense. However, for some reason, the actual command being run does not have LC_ALL=C before sort. Please see the attached screenshot for reference.

image

@bgruening
Copy link
Member

Ah gotcha, PR with the fix is here: bgruening/galaxytools#1520
We might want to add a simple test for it. Lets discuss in the PR.

Thanks for reporting @lybCNU!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants