Skip to content

Latest commit

 

History

History
213 lines (143 loc) · 10.9 KB

README.md

File metadata and controls

213 lines (143 loc) · 10.9 KB

DOI

Minnow Segmented Traits

We use a segmentation model to extract traits from minnows (Family: Cyprinidae).

This repository serves as a case study of an automated workflow and extraction of morphological traits using machine learning on image data.

We expand upon work already done by BGNN, including metadata collection by the Tulane Team and the Drexel Team (see Leipzig et al. 2021, Pepper et al. 2021, and Narnani et al. 2022), and a segmentation model developed by the Virginia Tech Team. We developed morphology extraction tools (Morphology-analysis) with the help of the Tulane Team. We incorporate these tools into BGNN_Core_Workflow.

Finally, with the help of the Duke Team, we create an automated workflow.

workflow

Goals

  • Create a use case for using an automated workflow
  • Show best practices for interacting with other repositories
  • Show utility of using a machine learning segmentation model to accelerate trait extraction from images of specimens

Organization

Scripts

Files

Results

  • a folder for the outputs from the workflow
    1. tables of results from analyses
    2. /Figures contains all figures created from analyses

Config

  • contains the config.yml file
    • the user can change the file inputs or number of images under limit_images

Inputs

Data Files

The Previous_Measurements file is included in this repository.

The Fish-AIR input files will be downloaded from the Fish-AIR API. This requires a Fish-AIR API key be added to Fish_AIR_API_Key in config/config.yaml. Alternatively you can download the Fish-AIR input files from Zenodo and place them in the Files/Fish-AIR/Tulane directory.

Components

The total size of the components are 5.6G (as of 5 May 2023).

All weights and dependencies for all components of the workflow are uploaded to Hugging Face or Zenodo.

  • Metadata by Drexel Team

  • Reformatting of metadata

    • Trim metadata output from Metadata step to only the values necessary for this project
    • Repository
    • Code Archive
  • Crop Image

    • Extract bounding box information from metadata file
    • Resizes and crops fish from image
    • Repository
    • Code Archive
  • Segmentation Model by Virginia Tech Team

  • Morphology analysis by Tulane Team and Battelle Team

  • Machine Learning Workflow by Battelle Team and Duke Team

Images

The fish images are from the Great Lakes Invasives Network (GLIN) and stored on Fish-AIR. We are using images specifically from the Illinois Natural History Survey (INHS images).

Image Selection

R code (Minnow_Selection_Image_Quality_Metadata.R) was used to filter out high quality, minnow images using the IQM and IM metadata files.

All image metadata files are downloaded from Fish-AIR and the version used is stored on the OSC data commons under the Fish Traits dataverse. The metadata files have been generated using the Tulane worflow.

Criteria for selection of an image was based on findings from Pepper et al. 2021.

Criteria chosen:

  • family == "Cyprinidae"
  • specimenView == "left"
  • specimenCurved == "straight"
  • allPartsVisible == "True"
  • partsOverlapping == "True"
  • partsFolded == "False"
  • uniformBackground == "True"
  • partsMissing == "False"
  • brightness == "normal"
  • onFocus == "True"
  • colorIssues == "none"
  • containsScaleBar == "True"
  • from either INHS or UWZM institutions

Analysis

See more details in Morphology-analysis.

Each segmented image has the following traits: trunk, head, eye, dorsal fin, caudal fin, anal fin, pelvic fin, and pectoral fin. For each segmented trait, there may be more than one "blob", or group of pixels identifying a trait. We created a matrix of presence.absence.matrix.csv.

For each trait, we counted the number of "blobs" and the percentage of the largest blob as a proportion of all blobs for a trait.

All intermediate tables will be saved in the folder "Results".

Figures

We created a heat map to show the success of the segmentation to detect traits from the images.

Figures are in the folder "Results".

Running the Workflow

Instructions are provided for running the workflow on a single computer or a SLURM cluster.

The run time for 20 images is about 45 minutes and the run time for all the images is about 2 hours.

Software Requirements

To run the workflow conda and Singularity (now Apptainer) must to be installed.

Component Software Dependencies

This workflow will automatically download and setup the software dependencies required by the workflow components. These dependencies are provided using either Singularity Containers or Conda Environments. Singularity Containers are used to provide the machine learning components essential to this workflow. Singularity Containers enable highly reproducible and portable software components. However, using Singularity Containers can pose challenges for script development by domain scientists. Therefore, we use Conda Environments for the domain scientist scripts included in this workflow.

Hardware Requirements

Minimally the workflow requires 1 CPU, 5 GB memory, and 30 GB disk space. A Linux machine is required for this workflow to provide Singularity containerization.

Install Workflow Runner

To run the workflow Snakemake v7 with mamba must be installed. (The workflow definition is not compatible with Snakemake v8+.) To handle this we create a new conda environment named "snakemake".

If you are running the workflow on a cluster that provides a conda environment module you should load that module (eg. module load miniconda3).

Run the following command to create a conda environment named "snakemake" with the required workflow dependencies.

conda create -c conda-forge -c bioconda -n snakemake snakemake=7 mamba

Enter "Y" when prompted to install snakemake and mamba.

If you loaded an environment module you should unload it (eg. module purge).

See the official instructions for installing snakemake for more options.

Limit images

In the config/config.yaml file, the user can limit the number of images for a test run by change the integer under limit_images, or run them all by entering "". Be aware that downloading all the images and running the workflow takes a couple of hours.

Run snakemake

Run the following commands to activate the conda environment and run the workflow:

source activate snakemake
snakemake --jobs 1 --use-singularity --use-conda

The --jobs argument specifies how many processes the snakemake can run at a time.

Run snakemake on a SLURM Cluster

Running the workflow on a SLURM cluster enables scaling beyond a single machine. The run-workflow.sh sbatch script is provided to run the workflow using sbatch and will process up to 20 jobs simultaneously.

If your SLURM cluster provides a conda environment module you should load that module before running the next step(eg. module load miniconda3).

Run the following commmand to activate the snakemake conda environment:

source activate snakemake

Running on the workflow in the background:

sbatch run-workflow.sh

Then you can monitor the job progress as you would with any SLURM background job. Some SLURM clusters require providing sbatch a SLURM account name via the --account command line argument.

See the Run-on-OSC wiki article for the commands used to run the workflow on OSC.

Run on Docker

In some cases it is possible to run the workflow using Docker. See the experimental Docker Instructions for more details.