Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arroyo Deployment Issue in Kubernetes #728

Open
tjqc0512 opened this issue Aug 30, 2024 · 1 comment
Open

Arroyo Deployment Issue in Kubernetes #728

tjqc0512 opened this issue Aug 30, 2024 · 1 comment

Comments

@tjqc0512
Copy link

I deployed arroyo in Kubernetes using helm. After creating a pipeline, I manually deleted the worker-pod, but the worker-pod did not restart successfully.
arroyo version: 0.11.3
kubernets version: 1.23.4
The arroyo-controller logs reported the following error:

{"timestamp":"2024-08-30T02:15:47.893698Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Running","to":"Recovering","duration_ms":"301261"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:15:49.627405Z","level":"INFO","fields":{"message":"stopping job","job_id":"job_IEzecx5ee6"},"target":"arroyo_controller::states::recovering"}
{"timestamp":"2024-08-30T02:17:19.629363Z","level":"WARN","fields":{"message":"failed to stop job","error":"status: Cancelled, message: \"Timeout expired\", details: [], metadata: MetadataMap { headers: {} }\n\nCaused by:\n 0: transport error\n 1: Timeout expired","job_id":"job_IEzecx5ee6"},"target":"arroyo_controller::states::recovering"}
{"timestamp":"2024-08-30T02:17:19.679602Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Recovering","to":"Compiling","duration_ms":"91785"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:17:19.690315Z","level":"INFO","fields":{"message":"state transition","job_id":"job_IEzecx5ee6","from":"Compiling","to":"Scheduling","duration_ms":"10"},"target":"arroyo_controller::states"}
{"timestamp":"2024-08-30T02:17:19.710782Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting workers on k8s","replicas":1,"task_slots":1},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:19.711137Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting workers on k8s","replicas":1,"task_slots":1},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:19.711144Z","level":"INFO","fields":{"job_id":"job_IEzecx5ee6","message":"starting worker","pod":"my-arroyo-worker-job-iezecx5ee6-2-0"},"target":"arroyo_controller::schedulers::kubernetes"}
{"timestamp":"2024-08-30T02:17:34.109304Z","level":"INFO","fields":{"message":"Worker registered: RegisterWorkerReq { worker_id: 2793387234007737385, node_id: 0, job_id: \"job_IEzecx5ee6\", rpc_address: \"http://172.16.136.62:6900\", data_address: \"172.16.136.62:38869\", resources: Some(WorkerResources { slots: 8 }), slots: 1 } -- Some(172.16.136.62:44582)"},"target":"arroyo_controller"}
{"timestamp":"2024-08-30T02:17:34.109414Z","level":"INFO","fields":{"message":"connecting to worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"rpc_address":"http://172.16.136.62:6900"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.110356Z","level":"INFO","fields":{"message":"restoring checkpoint","job_id":"job_IEzecx5ee6","epoch":18,"min_epoch":12},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.111128Z","level":"INFO","fields":{"message":"starting execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.115330Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":0,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(1), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.217581Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":1,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(3), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.319368Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":2,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(5), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.421507Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":3,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(7), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.523622Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":4,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(9), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.625573Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":5,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(11), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.727750Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":6,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(13), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.829990Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":7,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(15), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:34.932011Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":8,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(17), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:35.034056Z","level":"ERROR","fields":{"message":"failed to start execution on worker","job_id":"job_IEzecx5ee6","worker_id":2793387234007737385,"attempt":9,"error":"Status { code: Cancelled, message: \"h2 protocol error: http2 error: stream error received: stream no longer needed\", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(19), CANCEL, Remote) }))) }"},"target":"arroyo_controller::states::scheduling"}
{"timestamp":"2024-08-30T02:17:35.135364Z","level":"ERROR","fields":{"message":"panicked at crates/arroyo-controller/src/states/scheduling.rs:488:21:\nFailed to start execution on workers WorkerId(2793387234007737385)","panic.file":"crates/arroyo-controller/src/states/scheduling.rs","panic.line":488,"panic.column":21},"target":"arroyo_server_common"}
{"timestamp":"2024-08-30T02:17:35.135414Z","level":"ERROR","fields":{"message":"fatal state error","job_id":"job_IEzecx5ee6","state":"Scheduling","error_message":"Failed to start cluster for pipeline","error":"task 13920 panicked"},"target":"arroyo_controller::states"}
@Cirr0e
Copy link

Cirr0e commented Nov 28, 2024

Based on the error logs and similar issues, this appears to be a problem with the worker pod recovery process in a distributed Kubernetes environment. Let me help you resolve this.

The key issue here is that when the worker pod is trying to restart, it's failing to establish proper communication with the controller and restore from checkpoints. This is often related to storage configuration in distributed environments.

Here's what we need to do:

  1. First, validate your storage configuration. For distributed Kubernetes deployments (unlike single-node setups), you need proper distributed storage configuration. You should be using either:

    • An S3-compatible storage solution
    • A distributed filesystem (like NFS)
  2. Update your Helm values to use S3 for checkpoint storage. Here's an example configuration:

controller:
  env:
    - name: CHECKPOINT_STORAGE_BACKEND
      value: "s3"
    - name: AWS_ACCESS_KEY_ID
      value: "your-access-key"
    - name: AWS_SECRET_ACCESS_KEY
      value: "your-secret-key"
    - name: AWS_REGION
      value: "your-region"
    - name: CHECKPOINT_S3_BUCKET
      value: "your-bucket-name"
  1. Make sure the worker service account has proper permissions:
controller:
  env:
    - name: K8S_WORKER_SERVICE_ACCOUNT_NAME
      value: "arroyo-worker"
  1. After updating the configuration, you'll need to:
    helm upgrade arroyo arroyo/arroyo -f values.yaml

Important considerations:

  • Make sure your S3 bucket is accessible from all Kubernetes nodes
  • Verify network policies allow communication between worker and controller pods
  • Check that your worker service account has necessary permissions

References:

  1. Similar issue was resolved in ticket Pipeline stucks in "Scheduling" #262 where local storage configuration didn't work in distributed environments
  2. The error "stream error received: stream no longer needed" suggests connection issues between worker and controller

Let me know if you need help with any of these steps or encounter any issues during the implementation.

A quick note about the risks:

  • Changing storage configuration might require pipeline redeployment
  • Existing checkpoints might need migration if changing storage backends
  • Service account changes might require cluster admin intervention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants