Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsatbed: new tool for reporting short tandem repeats as bed track features. #6145

Merged
merged 30 commits into from
Jul 21, 2024

Conversation

fubar2
Copy link
Member

@fubar2 fubar2 commented Jul 13, 2024

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection

This PR proposes a new tool and suggestions from anyone interested in microsatellites and STRs would be welcomed

Motivation was to recreate some of the NIH MARBL T2T assembly polishing browser tracks in Galaxy workflows for the VGP. Those tracks display the density of specific dinucleotides such as CG in 128nt windows to help identify problematic regions, where they are over-represented, and may introduce technical errors in alignment for different kinds of sequencing methods.

The tool can be configured to output all motifs of one or more lengths from 1-6nt.
Alternatively, specific motifs can be provided as a comma separated string.
Two or more sequential repeats can be required although dimers can be reported as singletons.

This makes it potentially applicable for visualising the distribution of short tandem repeats and other kinds of microsatellites as bed or gff tracks on a reference fasta. For downstream processing of all exact STRs, the underlying python tool pytrfcan be run using the findstr option, producing either gff, tsv or csv outputs as described at the end of the documentation.

JBrowse2 bigwig sample outputs for the 4 dinucleotide density tracks using the HG002 assembly

image

@fubar2 fubar2 requested review from bgruening and removed request for bgruening July 17, 2024 04:03
fubar2 added 5 commits July 17, 2024 16:50
fix dimer minimum of 1
fix tests to conform to new structures
…size for size selected and specified motifs

not so easy for native pytrf operation
@fubar2 fubar2 marked this pull request as ready for review July 19, 2024 03:50
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
tools/microsatbed/microsatbed.xml Outdated Show resolved Hide resolved
fubar2 and others added 3 commits July 20, 2024 09:30
@fubar2
Copy link
Member Author

fubar2 commented Jul 20, 2024

Thanks @bgruening - lots of warts fixed!
I'm using it in workflows and it does exactly what the VGP are looking for - I've added an optional windowed bigwig density output instead of bed for downstream conversion because it turns out that the bed files are huge and really not useful as such for VGP although someone might need them...

<param name="tetramin" value="20"/>
<param name="pentamin" value="20"/>
<param name="hexamin" value="20"/>
<output name="bed" value="dibed_wig_sample" compare="sim_size" delta="10"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add here some asserts? eg. column_count, row_count etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

Copy link
Member Author

@fubar2 fubar2 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening: Yes, thanks - good idea - sorry for the delay - am travelling and distracted - in marvelous Melbourne and beyond...

Have added assertions to all the tests and fixed the leftover merge conflict marker.

Also figured out how to implement your suggestion to make the command section less complex with "--foo" values for the di/tri... etc selector which is now a macro since it is needed twice.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks @fubar2

@bgruening bgruening merged commit 275acb7 into galaxyproject:main Jul 21, 2024
11 checks passed
nilchia pushed a commit to pavanvidem/tools-iuc that referenced this pull request Aug 24, 2024
… features. (galaxyproject#6145)

* Preparing draft PR

* typo

* typo

* redundant copy

* fix flakery

* sheesh. local flake8 was fine, I swear....

* Add comments explaining two non-obvious issues being dealt with - 1 based reporting and decorated sorting for bed output

* flake8 strikes again.

* flake8 a trailing space. Oy.

* added a native pytrf mode to run findstr and make csv or tsv or gtf format output for all perfect STRs.
This makes the tool more generic and makes better use of pytrf...

* update readme with stuff from the tool help

* forgot the new test output for the new native pytrf findstr command line test

* make native test output smaller than 43MB. Eeesh.

* add missing smaller test fa

* remove bogus print

* Add test for built-in genome and paraphenaliae

* reverted the inbuilt test because cannot get the output labelled informatively about the provenance otherwise.
Cannot pass the element_identifier from a test parameter it seems - it's a built-in so not surprisingly over-ridden.

* rationalise minima for everything.
fix dimer minimum of 1
fix tests to conform to new structures

* make bed sample small enough

* add option for windowed density bigwig output with selectable window size for size selected and specified motifs
not so easy for native pytrf operation

* fix flake8 issue

* more flakery

* remove pybigtools to use ucsc-bedgraphtobigwig on a bedgraph instead
jbrowse2 simply will not read the bigwigs made using pybigtools :(

* Fixes suggested by Bjoern's review

* Update tools/microsatbed/microsatbed.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update microsatbed.xml

* add help for windowed density bigwig option

* fix logic for multiple flags
add contents assertions to all tests

* remove bogus old merge mark

---------

Co-authored-by: Björn Grüning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants