Skip to content

Commit

Permalink
Ab tensorize speed up (#552)
Browse files Browse the repository at this point in the history
* Changed parallelization within tensorize.sh so it runs in parallel within docker rather than parallel docker - about a 10x speed up.  Added hdf5-tools to the docker for testing the hd5 files.  Added validate_tensors.sh that can test that a generated hd5 file is complete.

* Made parallel use the NUM_JOBS flag to see how many run in parallel.

* Fixed defaults for min and max sample id when not provided.

* Fixed permissions on validate_tensors.sh so it is executable

* Removed redundant line in tensorize.sh and update documentation about scaling instance types and debugging tensorization.
  • Loading branch information
abaumann authored Feb 8, 2024
1 parent 4975c72 commit 6fda09c
Show file tree
Hide file tree
Showing 6 changed files with 123 additions and 40 deletions.
2 changes: 1 addition & 1 deletion docker/vm_boot_images/config/ubuntu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
# Other necessities
apt-get update
echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections
apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg
apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg hdf5-tools
51 changes: 46 additions & 5 deletions ml4h/tensorize/TENSORIZE.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,52 @@ scripts/tensorize.sh -t /mnt/disks/my-disk/ -i log -n 4 -s 1000000 -e 1030000 -
```

This creates and writes tensors to `/mnt/disks/my-disk` (logs can be found at `/mnt/disks/my-disk/log`)
by running `4` jobs in parallel (recommended to match that number to your VM's number of vCPUs)
starting with the sample ID `1000000` and ending with the sample ID `1004000` using both the `EKG` (field ID `20205`)
and the `MRI` (field ID `20209`) data. The tensors will be `.hd5` files named after corresponding sample IDs
(e.g. `/mnt/disks/my-disk/1002798.hd5`).

by running `4` jobs in parallel (the `-n 4`) starting with the sample ID `1000000` and ending with the sample ID `1004000`
using both the `EKG` (field ID `20205`) and the `MRI` (field ID `20209`) data. The tensors will be `.hd5`
files named after corresponding sample IDs (e.g. `/mnt/disks/my-disk/1002798.hd5`).

### Determining machine size
It is recommended to match the `-n` value to your VM's number of vCPUs since these jobs run efficiently.
If we run too many in parallel, we will hit up against other limits on the machine, most notably disk speed.
SSD is highly recommended to make these jobs run faster, but at a certain num of vCPUs and associated -n values
disk speed will be the bottleneck. You can check the resource usage for your VM in the
Google Cloud Console's VM Observability Page, which you can access from the Observability tab within the VM
instance details page. It is recommended to install the Ops Agent in order to check memory usage if you are looking
to optimize the speed/resource usage, or if you are finding these jobs taking longer than expected and you want
to figure out why.

As an example use case to help figure out how to understand usage or scale the VM properly, we had ~100k5k nifti
zips to process, producing ~50k tensors. On a machine with 22 cores and the data on SSD, this took ~20 hours to run.
This instance type was a `c3-standard-22` which has 88 GB of memory. Tensorization never went about 20% of memory usage,
but SSD speed and CPU speed were at near 100%, suggesting we could run in the same time on a machine with 22 cores and
~18GB of ram.

### Validating and checking progress on tensors
While tensorization runs on the VM, you will see output from each tensorization job, but as these run in parallel
it's a bit hard to understand how far along we are. If you leave the `tmux` session where this is running, you
can check how far along we are on tensorization (how many hd5s have been generated) by running something like
`ls /mnt/disks/output_tensors_directory | wc -l`. In order to validate the hd5s that we've generated, you can
use the following script: `./scripts/validate_tensors.sh /mnt/disks/output_tensors_directory 20 | tee completed_tensors.txt`
where the number 20 should be the same value you used for tensorization above (usually number of vCPUs). This
will output a file called completed_tensors.txt that says "OK" or "BAD" - you can then filter down on BAD files and
determine what went wrong.

### Notes on tensorizing UKBB data
The `DOWNLOAD.bulk` file or downloaded zips will tell us what samples/files we're going to process, you can get the
unique set of sample ids with a command like this: `cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq`. This can be
helpful in validating we have all our output tensors.
We can check whether we have all input data turned into tensors with a command like this:
`comm -23 <(cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq) <(ls /mnt/disks/output_tensors_directory | cut -d '.' -f 1 | sort | uniq)`
That will look for all sample ids in the bulk download file and compare that to the output tensors directory, only outputting
samples that tensors were not found for.

Note that for ecgs and mris there are some instances within the UKBB where data
was only collected at instance 3 and not instance 2 (whereas most samples have instance 2 and sometimes instance 3) - tensorization
will skip these cases where only instance 3 data exists. You can check if a given sample only has instance 2 data using a command
like this `comm -23 <(cat DOWNLOAD.bulk | grep "3_0" | cut -d ' ' -f 1 | sort | uniq) <(cat DOWNLOAD.bulk | grep "2_0" | cut -d ' ' -f 1 | sort | uniq)`
This will output sample ids that only have instance 2 data.

### Managing the generated tensors
If you're happy with the results and would like to share them with your collaborators, don't forget to make your disk
`read-only` so they can attach their VMs to it as well. You can do this by

Expand Down
7 changes: 6 additions & 1 deletion ml4h/tensorize/tensor_writer_ukbb.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,17 @@ def write_tensors(

start_time = timer() # Keep track of elapsed execution time
tp = os.path.join(tensors, str(sample_id) + TENSOR_EXT)

if os.path.exists(tp):
raise Exception(f"File already exists: {tp} - please use merge_hd5s.sh to merge data from two hd5 files.")

if not os.path.exists(os.path.dirname(tp)):
os.makedirs(os.path.dirname(tp))
if _prune_sample(sample_id, min_sample_id, max_sample_id, mri_field_ids, xml_field_ids, zip_folder, xml_folder):
continue

try:
with h5py.File(tp, 'a') as hd5:
with h5py.File(tp, 'w') as hd5:
_write_tensors_from_zipped_dicoms(write_pngs, tensors, mri_unzip, mri_field_ids, zip_folder, hd5, sample_id, stats)
_write_tensors_from_zipped_niftis(zip_folder, mri_field_ids, hd5, sample_id, stats)
_write_tensors_from_xml(xml_field_ids, xml_folder, hd5, sample_id, write_pngs, stats, continuous_stats)
Expand Down
84 changes: 53 additions & 31 deletions scripts/tensorize.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
################### VARIABLES ############################################

TENSOR_PATH=
NUM_JOBS=96
NUM_JOBS=20
SAMPLE_IDS_START=1000000
SAMPLE_IDS_END=6030000
XML_FIELD= # exclude ecg data
Expand Down Expand Up @@ -116,43 +116,65 @@ shift $((OPTIND - 1))
START_TIME=$(date +%s)

# Variables used to bin sample IDs so we can tensorize them in parallel
INCREMENT=$(( ( $SAMPLE_IDS_END - $SAMPLE_IDS_START ) / $NUM_JOBS ))
COUNTER=1
MIN_SAMPLE_ID=$SAMPLE_IDS_START
MAX_SAMPLE_ID=$(( $MIN_SAMPLE_ID + $INCREMENT - 1 ))

# Run every parallel job within its own container -- 'tf.sh' handles the Docker launching
while [[ $COUNTER -lt $(( $NUM_JOBS + 1 )) ]]; do
echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:"

cat <<LAUNCH_CMDLINE_MESSAGE
$HOME/ml4h/scripts/tf.sh -c $HOME/ml4h/ml4h/recipes.py
--mode $TENSORIZE_MODE \
--tensors $TENSOR_PATH \
--output_folder $TENSOR_PATH \
$PYTHON_ARGS \
--min_sample_id $MIN_SAMPLE_ID \
--max_sample_id $MAX_SAMPLE_ID &
LAUNCH_CMDLINE_MESSAGE
MAX_SAMPLE_ID=$SAMPLE_IDS_END

# We want to get the zip folder that was passes to recipes.py - look for the --zip_folder argument and extract the value passed after that
ZIP_FOLDER=$(echo ${PYTHON_ARGS} | sed 's/--zip_folder \([^ ]*\).*/\1/')
if [ ! -e $ZIP_FOLDER ]; then
echo "ERROR: Zip folder passed was not valid, found $ZIP_FOLDER but expected folder path." 1>&2
exit 1
fi

$HOME/ml4h/scripts/tf.sh -c $HOME/ml4h/ml4h/recipes.py \
--mode $TENSORIZE_MODE \
--tensors $TENSOR_PATH \
--output_folder $TENSOR_PATH \
$PYTHON_ARGS \
--min_sample_id $MIN_SAMPLE_ID \
--max_sample_id $MAX_SAMPLE_ID &
# create a directory in the /tmp/ folder to store some utilities for use later
mkdir -p /tmp/ml4h
# Write out a file with the ids of every sample in the input folder
echo "Gathering list of input zips to process between $MIN_SAMPLE_ID and $MAX_SAMPLE_ID, this takes several seconds..."
find $ZIP_FOLDER -name '*.zip' | xargs -I {} basename {} | cut -d '_' -f 1 \
| awk -v min="$MIN_SAMPLE_ID" -v max="$MAX_SAMPLE_ID" '$1 > min && $1 < max' \
| sort | uniq > /tmp/ml4h/sample_ids_trimmed.txt

NUM_SAMPLES_TO_PROCESS=$(cat /tmp/ml4h/sample_ids_trimmed.txt | wc -l)
echo "Including $NUM_SAMPLES_TO_PROCESS samples in this tensorization job."


echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:"

# we need to run a command using xargs in parallel, and it gets rather complex and messy unless we can just run
# a shell script - the string below is written to a shell script that takes in positional arguments and sets
# min and max sample id to be the incoming sample id (min) to incoming sample id + 1 (max) - this lets us
# run on a single sample at a time
SINGLE_SAMPLE_SCRIPT='HOME=$1
TENSORIZE_MODE=$2
TENSOR_PATH=$3
SAMPLE_ID=$4
# drop those first 4 above and get all the rest of the arguments
shift 4
PYTHON_ARGS="$@"
python $HOME/ml4h/ml4h/recipes.py --mode $TENSORIZE_MODE \
--tensors $TENSOR_PATH \
--output_folder $TENSOR_PATH \
$PYTHON_ARGS \
--min_sample_id $SAMPLE_ID \
--max_sample_id $(($SAMPLE_ID+1))'
echo "$SINGLE_SAMPLE_SCRIPT" > /tmp/ml4h/tensorize_single_sample.sh
chmod +x /tmp/ml4h/tensorize_single_sample.sh

# NOTE: the < " --version; > below is very much a hack - it's a way to escape tf.sh's running "python" followed by
# whatever you pass with -c. This causes it to run "python --version; " and then whatever you have after the semicolon.
read -r -d '' TF_COMMAND <<LAUNCH_CMDLINE_MESSAGE
$HOME/ml4h/scripts/tf.sh -m "/tmp/ml4h/" -c " --version; \
cat /tmp/ml4h/sample_ids_trimmed.txt | \
xargs -P $NUM_JOBS -I {} /tmp/ml4h/tensorize_single_sample.sh $HOME $TENSORIZE_MODE $TENSOR_PATH {} $PYTHON_ARGS"
LAUNCH_CMDLINE_MESSAGE

let COUNTER=COUNTER+1
let MIN_SAMPLE_ID=MIN_SAMPLE_ID+INCREMENT
let MAX_SAMPLE_ID=MAX_SAMPLE_ID+INCREMENT
echo "Executing command within tf.sh: $TF_COMMAND"
bash -c "$TF_COMMAND"

sleep 4s
done

################### DISPLAY TIME #########################################

END_TIME=$(date +%s)
ELAPSED_TIME=$(($END_TIME - $START_TIME))
printf "\nDispatched $((COUNTER - 1)) tensorization jobs in "
printf "\nDispatched $NUM_SAMPLES_TO_PROCESS tensorization jobs in "
display_time $ELAPSED_TIME
4 changes: 2 additions & 2 deletions scripts/tf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,6 @@ ${GPU_DEVICE} \
-v ${WORKDIR}/:${WORKDIR}/ \
-v ${HOME}/:${HOME}/ \
${MOUNTS} \
${DOCKER_IMAGE} /bin/bash -c "pip3 install --upgrade pip
pip install ${WORKDIR};
${DOCKER_IMAGE} /bin/bash -c "pip install --quiet --upgrade pip
pip install --quiet ${WORKDIR};
eval ${CALL_DOCKER_AS_USER} ${PYTHON_COMMAND} ${PYTHON_ARGS}"
15 changes: 15 additions & 0 deletions scripts/validate_tensors.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# use this script to validate the tensors created by the tensorize.sh script
# expects two positional arguments: directory containing the tensors and the number of threads to use
# example: ./validate_tensors.sh /mnt/disks/tensors/ 20 | tee completed_tensors.txt
# the output will be in the following form:
# OK - /mnt/disks/tensors/ukb1234.hd5
# BAD - /mnt/disks/tensors/ukb5678.hd5


INPUT_TENSORS_DIR=$1
NUMBER_OF_THREADS=$2


find ${INPUT_TENSORS_DIR}/*.hd5 | \
xargs -P ${NUMBER_OF_THREADS} -I {} \
bash -c "h5dump -n {} | (grep -q 'HDF5 \"{}\"' && echo 'OK - {}' || echo 'BAD - {}')"

0 comments on commit 6fda09c

Please sign in to comment.