Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

job failed after shuffle pod restart #606

Open
ChenLingPeng opened this issue Jan 24, 2018 · 2 comments
Open

job failed after shuffle pod restart #606

ChenLingPeng opened this issue Jan 24, 2018 · 2 comments

Comments

@ChenLingPeng
Copy link

ChenLingPeng commented Jan 24, 2018

How to reproduce

  1. Submit a job like PageRank use external shuffle service
  2. After executors running, stop some external-shuffle-service pod in executor's host
  3. The external-shuffle-service pod will restart with some new pod IP
  4. Driver exit with failed status

See the log in driver/executor, it shows pod always try to fetch block using old shuffle-pod-ip

@liyinan926
Copy link
Member

Yes, we use the shuffle pod IP to identify the shuffle pod and set spark.shuffle.service.host to the IP. So it seems shuffle pods need sticky network identify.

@weixiuli
Copy link

May avoid this issus by using hostnetwork ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants