-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911
Comments
It sounds like there is some path related problem. The Dockerfile looks reasonable. Have you tried debugging this by running the container locally and then confirming that there is a module named models?
|
Sorry, my description above was a bit obfuscated, let me clarify. Attempt 1: Extend official Docker Image and add my custom code So if I use the following
Attempt 2: use --prebuild_sdk_container_engine So if I use the following
However this isn't a valid solution, because these flags are pre-building the SDK using cloud build, and does not contain any of my custom code required for downstream components (transform, trainer - hence the Attempt 3: use custom container, without Dataflow Which tells there isn't a bug with how the TFX lib is leveraging apache beam, but leveraging apache beam with Dataflow. |
OK, that narrows it down a bit. We also extend the TFX base image and successfully use the custom image with Dataflow. I don't have the time to try to reproduce your ticket (I am a user of the project, not a dev), but I can share our working config. Here is what we pass into
Hope that helps. |
Unfortunately no luck @IzakMaraisTAL, my DF worker still gives the following logs / error on endless loop: Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup. |
Aah, another clue. This might be related to one of these tickets #6368, #6386? We currently use TFX 1.14 with the following Dockerfile FROM tensorflow/tfx:1.14.0
WORKDIR /pipeline
COPY requirements.txt requirements.txt
# Fix pip in the tfx:1.14.0 base image to use the correct python environment (3.8, not 3.10)
RUN sed -i 's/python3/python/g' /usr/bin/pip
RUN pip install -r requirements.txt
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
# Fix dataflow job sometimes not starting
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY version \
vertex_runner.py \
config.py ./
COPY ml_pipeline ./ml_pipeline |
If I roll back to 1.14.0 this worked for me! However using the same Dockerfile and extending from 1.15.1 instead still gives me this endless loop in my Dataflow worker log: I think I am going to roll back to 1.14.0 for now since it seems stable, however going to leave this issue open because there still seems to be a bug in 1.15.1 @IzakMaraisTAL thank you for all your help! Really really appreciated :) |
That is a pretty solid approach in my opinion - my approach was:
In this case, since the logs indicate we are stuck in an endless loop upon worker start up, the idea would be to hop into the code and see how the respective component is launching the Dataflow job and what is going on - something I am currently struggling with (hence the bug report here). |
System information
pip freeze
output):Describe the current behavior
As per the following links and issues:
I am using a custom docker container, extended from the corresponding version of the official tensorflow/tfx image, as my Pipeline default image and Dataflow sdk_container_image.
I have:
The worker will start, however the worker container hangs with the following error:
I am not sure why Apache Beam is not installed in the runtime environment, I can investigate the built image using:
and I can confirm that apache-beam exists within the container:
Describe the expected behavior
The worker stages should start successfully, write worker logs, and complete. For example:
Note the image above was achieved using the following beam args:
However this is not feasible because I need to get my custom code into the image for components which reference it (Transform, Trainer, Evalutaor). For example I get the following error with the Transform component (as I should):
Error processing instruction process_bundle-789893474543504259-101. Original traceback is Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/apache_beam/internal/dill_pickler.py", line 418, in loads return dill.loads(s) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 275, in loads return load(file, ignore, **kwds) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'
Standalone code to reproduce the issue
My files are arranged as such:
Dockerfile:
Apache Beam Args:
KubeFlow Runner & Config:
Component init in pipeline.py:
The text was updated successfully, but these errors were encountered: