Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ab tensorize speed up #552

Merged
merged 5 commits into from
Feb 8, 2024
Merged

Ab tensorize speed up #552

merged 5 commits into from
Feb 8, 2024

Conversation

abaumann
Copy link
Collaborator

@abaumann abaumann commented Feb 6, 2024

Modified how tensorize.sh works to speed it up about 10x - originally this was running a batch of samples in serial in a docker container, and running multiple docker containers in parallel. I noticed this was using very little cpu, memory, or disk. Changed it to istead use xargs -P to run in parallel in a single docker container - the CPU/disk/mem all stay at nearly maxed out with the right combination of settings (e.g. on a 22 core machine I did 20 in parallel). Memory usage was low, so I'll update docs about what machine specs to use/what flags to use.

As part of this, I added a few other things:

  • a script to validate the generated tensors - some times something can happen and they are corrupt, that script helps you find the corrupt ones
  • changed the pip installs of ml4h into the docker to be --quiet mode so it's less noisy (if it fails you will still see that)
  • added hdf5-tools to the docker for use in validating or looking at hd5 files
  • changed the tensor writer to open the file in write rather than append mode and throw an error if someone tries to overwrite a tensor that already exists - it had been happily appending to existing files with the same data, which would cause downstream effects

…thin docker rather than parallel docker - about a 10x speed up. Added hdf5-tools to the docker for testing the hd5 files. Added validate_tensors.sh that can test that a generated hd5 file is complete.
Copy link
Collaborator

@lucidtronix lucidtronix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abaumann one questions inline but feel free to merge!


find ${INPUT_TENSORS_DIR}/*.hd5 | \
xargs -P ${NUMBER_OF_THREADS} -I {} \
bash -c "h5dump -n {} | (grep -q 'HDF5 \"{}\"' && echo 'OK - {}' || echo 'BAD - {}')"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the script know if an HD5 bad here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this was one thing I wanted to check about, but basically it looks for output which starts with "HDF5" and then the file path and then information inside the file, so all it really does is check that h5dump doesn't exit - that seems to indicate it's not corrupt, but doesn't check anything about the actual contents.

${DOCKER_IMAGE} /bin/bash -c "pip3 install --upgrade pip
pip install ${WORKDIR};
${DOCKER_IMAGE} /bin/bash -c "pip install --quiet --upgrade pip
pip install --quiet ${WORKDIR};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@abaumann abaumann merged commit 6fda09c into master Feb 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants