Python Batch Runner: Beam Portability API Integration
Past due by over 2 years
50% complete
This milestone tracks all work required to run batch pipelines on Ray through integration with the Beam portability API: https://beam.apache.org/roadmap/portability/.
The Beam Portability API provides several advantages vs. the current BundleBasedDirectRunner implementation, including:
- Improved cross-language portability posture due to the use of languag…
This milestone tracks all work required to run batch pipelines on Ray through integration with the Beam portability API: https://beam.apache.org/roadmap/portability/.
The Beam Portability API provides several advantages vs. the current BundleBasedDirectRunner implementation, including:
- Improved cross-language portability posture due to the use of language-agnostic constructs between the SDK and runner.
- Improved performance via built-in pipeline execution graph optimizers.
- Reduced maintainability burden via more future proof APIs and reusable components like execution graph schedulers and state managers.
Successful completion of this milestone will allow users to:
- Define and run any valid Python Beam batch pipeline either locally or on a remote cluster using Ray.
- Integrate Beam batch pipelines with existing machine learning and data science applications on Ray (e.g. via TFX, Beam Pandas DataFrames, etc.).