job failed after shuffle pod restart #606

ChenLingPeng · 2018-01-24T08:54:30Z

How to reproduce

Submit a job like PageRank use external shuffle service
After executors running, stop some external-shuffle-service pod in executor's host
The external-shuffle-service pod will restart with some new pod IP
Driver exit with failed status

See the log in driver/executor, it shows pod always try to fetch block using old shuffle-pod-ip

liyinan926 · 2018-01-24T18:43:15Z

Yes, we use the shuffle pod IP to identify the shuffle pod and set spark.shuffle.service.host to the IP. So it seems shuffle pods need sticky network identify.

weixiuli · 2018-06-21T07:13:46Z

May avoid this issus by using hostnetwork ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job failed after shuffle pod restart #606

job failed after shuffle pod restart #606

ChenLingPeng commented Jan 24, 2018 •

edited

Loading

liyinan926 commented Jan 24, 2018

weixiuli commented Jun 21, 2018

job failed after shuffle pod restart #606

job failed after shuffle pod restart #606

Comments

ChenLingPeng commented Jan 24, 2018 • edited Loading

liyinan926 commented Jan 24, 2018

weixiuli commented Jun 21, 2018

ChenLingPeng commented Jan 24, 2018 •

edited

Loading