Skip to content

Commit

Permalink
Merge branch 'slee/release-v0.8.0' into 'master'
Browse files Browse the repository at this point in the history
v0.8.0 documentation updates

See merge request machine-learning/dorado!1199
  • Loading branch information
susie-ont committed Sep 16, 2024
2 parents a69c0a2 + 86abc4c commit 16b0362
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 20 deletions.
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,40 @@

All notable changes to Dorado will be documented in this file.

# [0.8.0] (16 Sept 2024)

This release of Dorado adds v5.1 RNA models with new `inosine_m6A` and `m5C` RNA modified base models, updates existing modified base models, improves the speed of v5 SUP basecalling models on A100/H100 GPUs, and enhances the flexibility and stability of `dorado correct`. It also introduces per-barcode configuration for poly(A) estimation with interrupted tails, adds new `--output-dir` and `--bed-file` arguments to Dorado basecalling commands, and includes a variety of other improvements for stability and usability.

* a69c0a2987e60f3889cc56cd820e8a7713887f33 - Add v5.1.0 RNA basecalling models, including new `inosine_m6A` and `m5C` modified base models, and updated existing DNA and RNA modified base models
* 8e3a8707be5248d7bcc47d3e89b80c0bdc9c2f36 - Improve speed of v5 SUP basecalling models on A100 and H100 GPUs
* 6ee90189197d11bfe50e919067582da6eccf513e - Reduce false positive calls from v5 DNA modifed base models
* 69cb26032d8393a781a9a3d32aa2ceb13ec65491 - Fix bug causing intermittent crashing with v5 SUP models
* e9dec497a38fa2a1935f64d30e35db246da58a08 - Add `--resume-from` functionality to `dorado correct`
* cb6eee1c3d63da2f1f11fb8fcc63418908154f81 - Decouple alignment and inference stages in `dorado correct`
* df861db10d77b4056702857ca11d2e50b63946af - Prevent segfaults in `dorado correct`
* f35c8cc3ebf900cfd6e19cf6ebaebc94fbb8619b - Fix bug when downloading models for `dorado correct`
* 66467011c1f22f7037e1055bb435bef090790dd1 - Add per-barcode poly(A) configuration for interrupted tails
* 0b79407afdfc8f4fa66d5393a722e29358a6302d - Improve poly(A) length estimation for RNA and DNA
* df614abee24523abc2858b02d18205b9ebca53fe - Add `--output-dir` argument to `dorado basecaller` and `dorado duplex`
* f9beb393cd8237a142dba43f3b04a77cb688d1c0 - Add `--bed-file` argument to `dorado basecaller` and `dorado duplex`
* 1fc6f1eb5a535262ef601a8fe4674edc87a137c9 - Add `--models-directory` option to `basecaller`, `duplex`, and `download` to download and reuse models
* 966c2ca38369a21855cdd491b025979a9628b5b5 - Update POD5 version to v0.3.15
* 6ec77c8b6cfc3a53433ae27f7a5383f77097eefa - Fix errors when performing duplex calling with modified bases
* 4a28d589d5e244f62543ebb4d744e8c2843bde93 - Always trim DNA adapter signal before processing RNA reads
* a90fbf9729a1791be5e7da0f3aacc9d5c20135a8 - Fix loading of FASTQ files containing RNA with U bases
* 9e5db84725635ceaa282691e8e430dd56851ffa2 - Fix duplicated alignment tags in re-aligned files
* 3cc4de3c941601fad906c80b6c770fef2814ad9c - Prevent "Too many open files" error when using `--sort-bam` with `dorado demux`
* b53191858fd33e7a0b4832df6e9e38cf5af22add - Prevent `dorado basecaller` crash when signal-space trimming removes all raw data
* adc60bae22648fef6521608a625c0e5bc842ac2f - Package `libcupti.so` into ARM Linux builds
* 667d16001845c8f173ae44ecdef0befaabf2af10 - Remove kit name requirement in custom barcode configuration
* e9281fa6d9ff36a7fb51efed0caf2c776b7d0c33 - Emit an error message if header from input HTS file cannot be read
* 7f42b8fd869da5210f9a60a83a6656173023dcb0 - Warn and exit instead of crashing if a model path does not exist
* 7d7424615830f46f7246a20125b73573c78ef7c9 - Improve index file error handling
* c77733a9d5ec054a426147cdcd9f6e8a03399aff - Add a mechanism to cache auto batch size calculations
* a674dadec1b3feb1ed8e8a6421d5d3fc17b0b5bd - Update `--help` documentation for `basecaller`, `duplex`, and `correct`
* 022901e29864fdeb9a99c4961c679882fe4a6b34 - Fix JSON output when using `--list-structured` with `dorado download`


# [0.7.3] (1 Aug 2024)

This release of Dorado updates `dorado correct` to fix handling of high copy repeats and avoid shutdown hanging. It also includes `dorado demux` improvements to reduce false matches in midstrand barcode detection and ensure correct file naming, along with other fixes.
Expand Down
35 changes: 19 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ If you encounter any problems building or running Dorado, please [report an issu

First, download the relevant installer for your platform:

- [dorado-0.7.3-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.7.3-linux-x64.tar.gz)
- [dorado-0.7.3-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.7.3-linux-arm64.tar.gz)
- [dorado-0.7.3-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.7.3-osx-arm64.zip)
- [dorado-0.7.3-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.7.3-win64.zip)
- [dorado-0.8.0-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.0-linux-x64.tar.gz)
- [dorado-0.8.0-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.0-linux-arm64.tar.gz)
- [dorado-0.8.0-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.0-osx-arm64.zip)
- [dorado-0.8.0-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.0-win64.zip)

Once the relevant `.tar.gz` or `.zip` archive is downloaded, extract the archive to your desired location.

Expand Down Expand Up @@ -363,7 +363,7 @@ $ dorado download --model all
The names of Dorado models are systematically structured, each segment corresponding to a different aspect of the model, which include both chemistry and run settings. Below is a sample model name explained:
`dna_r10.4.1_e8.2_400bps_hac@v4.3.0`
`dna_r10.4.1_e8.2_400bps_hac@v5.0.0`
- **Analyte Type (`dna`)**: This denotes the type of analyte being sequenced. For DNA sequencing, it is represented as `dna`. If you are using a Direct RNA Sequencing Kit, this will be `rna002` or `rna004`, depending on the kit.
Expand All @@ -375,7 +375,7 @@ The names of Dorado models are systematically structured, each segment correspon
- **Model Type (`hac`)**: This represents the size of the model, where larger models yield more accurate basecalls but take more time. The three types of models are `fast`, `hac`, and `sup`. The `fast` model is the quickest, `sup` is the most accurate, and `hac` provides a balance between speed and accuracy. For most users, the `hac` model is recommended.
- **Model Version Number (`v4.3.0`)**: This denotes the version of the model. Model updates are regularly released, and higher version numbers typically signify greater accuracy.
- **Model Version Number (`v5.0.0`)**: This denotes the version of the model. Model updates are regularly released, and higher version numbers typically signify greater accuracy.
### **DNA models:**
Expand All @@ -387,8 +387,8 @@ The versioning of modification models is bound to the basecalling model. This me
| Basecalling Models | Compatible<br />Modifications | Modifications<br />Model<br />Version | Data<br />Sampling<br />Frequency |
| :-------- | :------- | :--- | :--- |
| **[email protected]** | | | 5 kHz |
| **[email protected]** | 4mC_5mC<br />5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v1<br />v1<br />v1<br />v1 | 5 kHz |
| **[email protected]** | 4mC_5mC<br />5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v1<br />v1<br />v1<br />v1 | 5 kHz |
| **[email protected]** | 4mC_5mC<br />5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v2<br />v2<br />v2<br />v2 | 5 kHz |
| **[email protected]** | 4mC_5mC<br />5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v2<br />v2<br />v2<br />v2 | 5 kHz |
| [email protected] | | | 5 kHz |
| [email protected] | 5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v1<br />v1<br />v2 | 5 kHz |
| [email protected] | 5mCG_5hmCG<br />5mC_5hmC<br />6mA<br /> | v1<br />v1<br />v2 | 5 kHz |
Expand Down Expand Up @@ -420,18 +420,21 @@ The versioning of modification models is bound to the basecalling model. This me
### **RNA models:**
**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types. The same applies to parsing inputs. Any input HTS file (e.g. FASTQ generated by `guppy`/`basecall_server`) with `U` bases is not handled by `dorado`.
**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types.
| Basecalling Models | Compatible<br />Modifications | Modifications<br />Model<br />Version | Data<br />Sampling<br />Frequency |
| :-------- | :------- | :--- | :--- |
| **[email protected]** | N/A | N/A | 4 kHz |
| **[email protected]** | m6A<br />pseU | v1<br />v1<br />v1 | 4 kHz |
| **[email protected]** | m6A<br />pseU | v1<br />v1<br />v1 | 4 kHz |
| [email protected] | N/A | N/A | 4 kHz |
| [email protected] | N/A | N/A | 4 kHz |
| **[email protected]** | | | 4 kHz |
| **[email protected]** | m5C<br />m6A_DRACH<br />inosine_m6A<br />pseU | v1<br />v1<br />v1<br />v1 | 4 kHz |
| **[email protected]** | m5C<br />m6A_DRACH<br />inosine_m6A<br />pseU | v1<br />v1<br />v1<br />v1 | 4 kHz |
| [email protected] | | | 4 kHz |
| [email protected] | m6A<br />m6A_DRACH<br />pseU | v1<br />v1<br />v1 | 4 kHz |
| [email protected] | m6A<br />m6A_DRACH<br />pseU | v1<br />v1<br />v1 | 4 kHz |
| [email protected] | | | 4 kHz |
| [email protected] | | | 4 kHz |
| [email protected] | m6A_DRACH | v1 | 4 kHz |
| rna002_70bps_fast@v3 | N/A | N/A | 3 kHz |
| rna002_70bps_hac@v3 | N/A | N/A | 3 kHz |
| rna002_70bps_fast@v3 | | | 3 kHz |
| rna002_70bps_hac@v3 | | | 3 kHz |
## Automatic model selection complex
Expand Down
8 changes: 4 additions & 4 deletions dorado/cli/basecaller.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -251,14 +251,14 @@ void set_dorado_basecaller_args(utils::arg_parse::ArgParser& parser, int& verbos
.default_value(false);
}
{
parser.visible.add_group("Poly-a arguments");
parser.visible.add_group("Poly(A) arguments");
parser.visible.add_argument("--estimate-poly-a")
.help("Estimate poly-A/T tail lengths (beta feature). Primarily meant for cDNA and "
"dRNA use cases.")
.help("Estimate poly(A)/poly(T) tail lengths (beta feature). Primarily meant for "
"cDNA and dRNA use cases.")
.default_value(false)
.implicit_value(true);
parser.visible.add_argument("--poly-a-config")
.help("Configuration file for PolyA estimation to change default behaviours")
.help("Configuration file for poly(A) estimation to change default behaviours")
.default_value(std::string(""));
}
{
Expand Down

0 comments on commit 16b0362

Please sign in to comment.