Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Tests for Distributed Ops #22

Open
EiffL opened this issue Nov 9, 2019 · 1 comment
Open

Implementing Tests for Distributed Ops #22

EiffL opened this issue Nov 9, 2019 · 1 comment
Assignees
Labels
enhancement New feature or request Mesh TensorFlow

Comments

@EiffL
Copy link
Member

EiffL commented Nov 9, 2019

We currently have tests to evaluate the TensorFlow version of FlowPM against our reference FastPM simulation code. They run automatically on Travis CI.
The problem now is that we want to be doing the same thing but with the Mesh implementation, which requires running the ops on a TensorFlow cluster.

We need to figure out how to run those tests automatically. The most likely answer will be to spawn a TF cluster on Travis, with like 4 CPU processes, and run the tests by connecting to this local cluster.

There are some caveats with that approach though, because as we have already seen, some ops behave differently on CPU and GPU, so...

@EiffL EiffL added enhancement New feature or request Mesh TensorFlow labels Nov 9, 2019
@EiffL EiffL self-assigned this Aug 13, 2020
@EiffL
Copy link
Member Author

EiffL commented Aug 13, 2020

This is high priority @modichirag I'm assigning myself, but also feel free to think about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Mesh TensorFlow
Projects
None yet
Development

No branches or pull requests

1 participant