Arroyo Deployment Issue in Kubernetes #728

tjqc0512 · 2024-08-30T02:30:10Z

I deployed arroyo in Kubernetes using helm. After creating a pipeline, I manually deleted the worker-pod, but the worker-pod did not restart successfully.
arroyo version: 0.11.3
kubernets version: 1.23.4
The arroyo-controller logs reported the following error:

{"timestamp":"2024-08-30T02:15:47.893698Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Running","to":"Recovering","duration_ms":"301261"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:15:49.627405Z","level":"INFO","fields":{"message":"stopping job","job_id":"job_IEzecx5ee6"},"target":"arroyo_controller::states::recovering"}
{"timestamp":"2024-08-30T02:17:19.629363Z","level":"WARN","fields":{"message":"failed to stop job","error":"status: Cancelled, message: \"Timeout expired\", details: [], metadata: MetadataMap { headers: {} }\n\nCaused by:\n 0: transport error\n 1: Timeout expired","job_id":"job_IEzecx5ee6"},"target":"arroyo_controller::states::recovering"}
{"timestamp":"2024-08-30T02:17:19.679602Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Recovering","to":"Compiling","duration_ms":"91785"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:17:19.690315Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Compiling","to":"Scheduling","duration_ms":"10"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:17:19.710782Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting workers on k8s","replicas":1,"task_slots":1},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:19.711137Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting workers on k8s","replicas":1,"task_slots":1},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:19.711144Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting worker","pod":"my-arroyo-worker-job-iezecx5ee6-2-0"},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:34.109304Z","level":"INFO","fields":{"message":"Worker registered: RegisterWorkerReq { worker_id: 2793387234007737385, node_id: 0, job_id: \"job_IEzecx5ee6\", rpc_address: \"http://172.16.136.62:6900\", data_address: \"172.16.136.62:38869\", resources: Some(WorkerResources { slots: 8 }), slots: 1 } -- Some(172.16.136.62:44582)"},"target":"arroyo_controller"}
{"timestamp":"2024-08-30T02:17:34.109414Z","level":"INFO","fields":{"message":"connecting to worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"rpc_address":"http://172.16.136.62:6900"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.110356Z","level":"INFO","fields":{"message":"restoring checkpoint","job_id":"job_IEzecx5ee6","epoch":18,"min_epoch":12},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.111128Z","level":"INFO","fields":{"message":"starting execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.115330Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":0,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(1), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.217581Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":1,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(3), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.319368Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":2,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(5), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.421507Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":3,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(7), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.523622Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":4,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(9), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.625573Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":5,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(11), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.727750Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":6,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(13), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.829990Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":7,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(15), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.932011Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":8,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(17), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:35.034056Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":9,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(19), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:35.135364Z","level":"ERROR","fields":{"message":"panicked at crates/arroyo-controller/src/states/scheduling.rs:488:21:\nFailed to start execution on workers WorkerId(2793387234007737385)","panic.file":"crates/arroyo-controller/src/states/scheduling.rs","panic.line":488,"panic.column":21},"target":"arroyo_server_common"}
{"timestamp":"2024-08-30T02:17:35.135414Z","level":"ERROR","fields":{"message":"fatal state error","job_id":"job_IEzecx5ee6","state":"Scheduling","error_message":"Failed to start cluster for pipeline","error":"task 13920 panicked"},"target":"arroyo_controller::states"}

The text was updated successfully, but these errors were encountered:

Cirr0e · 2024-11-28T19:13:09Z

Based on the error logs and similar issues, this appears to be a problem with the worker pod recovery process in a distributed Kubernetes environment. Let me help you resolve this.

The key issue here is that when the worker pod is trying to restart, it's failing to establish proper communication with the controller and restore from checkpoints. This is often related to storage configuration in distributed environments.

Here's what we need to do:

First, validate your storage configuration. For distributed Kubernetes deployments (unlike single-node setups), you need proper distributed storage configuration. You should be using either:
- An S3-compatible storage solution
- A distributed filesystem (like NFS)
Update your Helm values to use S3 for checkpoint storage. Here's an example configuration:

controller:
  env:
    - name: CHECKPOINT_STORAGE_BACKEND
      value: "s3"
    - name: AWS_ACCESS_KEY_ID
      value: "your-access-key"
    - name: AWS_SECRET_ACCESS_KEY
      value: "your-secret-key"
    - name: AWS_REGION
      value: "your-region"
    - name: CHECKPOINT_S3_BUCKET
      value: "your-bucket-name"

Make sure the worker service account has proper permissions:

controller:
  env:
    - name: K8S_WORKER_SERVICE_ACCOUNT_NAME
      value: "arroyo-worker"

After updating the configuration, you'll need to:
```
helm upgrade arroyo arroyo/arroyo -f values.yaml
```

Important considerations:

Make sure your S3 bucket is accessible from all Kubernetes nodes
Verify network policies allow communication between worker and controller pods
Check that your worker service account has necessary permissions

References:

Similar issue was resolved in ticket Pipeline stucks in "Scheduling" #262 where local storage configuration didn't work in distributed environments
The error "stream error received: stream no longer needed" suggests connection issues between worker and controller

Let me know if you need help with any of these steps or encounter any issues during the implementation.

A quick note about the risks:

Changing storage configuration might require pipeline redeployment
Existing checkpoints might need migration if changing storage backends
Service account changes might require cluster admin intervention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arroyo Deployment Issue in Kubernetes #728

Arroyo Deployment Issue in Kubernetes #728

tjqc0512 commented Aug 30, 2024

Cirr0e commented Nov 28, 2024

Arroyo Deployment Issue in Kubernetes #728

Arroyo Deployment Issue in Kubernetes #728

Comments

tjqc0512 commented Aug 30, 2024

Cirr0e commented Nov 28, 2024