Mbio airflow dags rx test #4

ruicatxiao · 2024-07-12T15:03:36Z

Validated working on 4 sets of 16S ampliconseq datasets, consists of both single-end and paired-end reads

…templating and etc

…ing to study file; c-> pointing to provenance file

d-callan · 2024-07-16T13:27:26Z

dag_configs/amplicon_studies.csv

@@ -0,0 +1,5 @@
+study,studyPath


so i dont want references to real data in here. those should be in https://github.com/microbiomeDB/amplicon_sequencing. but im ok w test data, so long as were very clear thats what it is. maybe rename this file to include the word 'test' explicitly.

Got it. Yeah these are all test data

d-callan · 2024-07-16T13:27:46Z

dag_configs/ampliseq.config

@@ -0,0 +1,5 @@
+


this file name should probably include the word 'template'

d-callan · 2024-07-16T13:28:10Z

dag_configs/processed_studies_provenance.csv

@@ -0,0 +1 @@
+study,timestamp,code_revision


and this file name should probably also include the word test

d-callan · 2024-07-16T13:30:06Z

dags/ampliseq.py

+# When encountering the bug regarding "cannot read served logs", need to clear failed tasks and it will attempt to re-run, should complete most of the time
+
+
+# Potentially to do list, have the DAG scheduled to run and check on a daily or weekly basis. Alternatively just keep it on a manual triggering


id leave it manually triggered for now, this is still in dev

d-callan · 2024-07-16T13:31:22Z

dags/ampliseq.py

-            # i think we want a dict of dicts
+
+    with DAG(
+        dag_id="rx_test_automated_ampliseq_v6",


we should change this dag id back to automated_ampliseq or whatever it was. can drop the word automated i suppose too if wed like.

d-callan · 2024-07-16T13:43:19Z

dags/ampliseq.py

+                                'path': path,
+                                'current_timestamp': current_timestamp
+                            })
+            return studies


similar naming issue here i think. it looks like we only add to studies if we plan to run it again. so this should be named something like studies_to_run or similar

this looks like it still needs resolved. its not clear which studies these are just by reading the name of the variable.. all of them? ones already processed? ones we need to process still?

dags/ampliseq.py

d-callan · 2024-07-16T13:50:59Z

dags/ampliseq.py

+            nextflow_task = BashOperator.partial(
+                task_id='nextflow_task',
+                bash_command=textwrap.dedent("""\
+                    nextflow run nf-core/ampliseq -with-tower -r 2.9.0 \


the 2.9.0 here should be replaced w AMPLISEQ_VERSION

dags/ampliseq.py

d-callan · 2024-07-16T14:03:06Z

dags/ampliseq.py

+
+        loaded_studies = load_studies()
+
+        with TaskGroup("nextflow_tasks", tooltip="Nextflow processing tasks") as nextflow_tasks:


i dont prefer making a task group of all the nextflow tasks, and another for all the r tasks. id rather a task group called something like run_ampliseq_with_post_processing and have it have two tasks, one for the nextflow task and the other the rscript per study.

as this is currently, i believe all nextflow tasks will have to complete successfully before any rscript ones will execute. we want instead any given study to move to its own rscript task as soon as its nextflow task completes.

d-callan

im going to approve, bc i dont see anything really conceptually or functionally wrong here. but id like to see some things renamed for clarity before you merge, please.

d-callan · 2024-07-19T16:24:00Z

dag_configs/usage_instructions.txt

@@ -0,0 +1 @@
+# Need to rename the files without the "test_" in file name for proper runs


this could be a README file, and could mention explicitly the paired data repo as an example of how to use this

d-callan · 2024-07-23T14:17:00Z

dags/ampliseq.py

+                                'path': path,
+                                'current_timestamp': current_timestamp
+                            })
+            return studies


this looks like it still needs resolved. its not clear which studies these are just by reading the name of the variable.. all of them? ones already processed? ones we need to process still?

d-callan · 2024-07-23T14:18:26Z

dags/ampliseq.py


-            commands.append(command)
+        with TaskGroup("processing_tasks", tooltip="Processing tasks") as processing_tasks:  # Merged task groups


i think we can improve the naming here as well.. what is the 'processing_task' doing? assume this dag grows one day and we might have multiple processing tasks. always favor overly specific and long names for things, as opposed to short or vague ones.

some specific suggestions: the task group could be called 'process_amplicon_studies', the nextflow task within that group could be called 'run_ampliseq' and the rscript task could be called 'run_r_postprocessing'

ill add another general rule w the naming of things.. putting 'task' in the name of a task isnt providing any further information to a reader (which might be your future self), so i wouldnt bother personally. you could try always starting names of tasks w a verb, if that helps.

…Misc naming changes

… them for remote execution, and transfer files back to local

d-callan

i know i made a lot of comments, but i hope its not overwhelming. i think overall youve done a really great job of exploring airflow for the first time, and have tried and discovered some very worthwhile things.

think my chief comment currently is mostly that we should try to work on code factoring so as to produce code easier to maintain

d-callan · 2024-08-22T16:42:23Z