Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate abnormal tasks #1539

Open
KrystianJanas opened this issue Jan 7, 2025 · 3 comments
Labels
bug Something isn't working performance Performance related

Comments

@KrystianJanas
Copy link

KrystianJanas commented Jan 7, 2025

Describe the bug
For a long time now we have noticed a problem resulting from refreshing metrics that are collected by the main master-node from the worker-node. We are currently operating on Dockerfile, on the AWS cloud. We have 1 master-node and 8 worker-nodes.

The problem is that the master-node often restarts without any problem. After longer analyses, it turned out that the problem is "metrics", which cannot be turned off in any way, because you have not implemented such a method. It would be very useful in the application.

Sometimes it is possible to "bug" them, restarting the entire infrastructure or adding one more worker-node. But this is not a permanent solution, because by bugging the metrics, the problem is solved for 1-2 days.

The problem is that because of metrics, the worker-node often loses connection with the master-node when the task is started, which is why we get the task status "abnormal", and we have to manually check whether the task has already been completed or is still running. At the moment this is very burdensome for us, as each worker has at least 4-5 tasks running.

We're running master-node and each worker-node on the crawlab-pro:latest image.

Master-node configuration:

version: '3.4'
services:
  crawlab:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab
    restart: always
    environment:
      - CRAWLAB_LICENSE
      - CRAWLAB_NODE_MASTER
      - CRAWLAB_MONGO_DB
      - CRAWLAB_MONGO_URI
      - CRAWLAB_DISABLE_METRICS
    volumes:
      - "/opt/.crawlab/master:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/master:/data"  # persistent crawlab data
    ports:
      - "9666:9666"  # exposed grpc port
    mem_limit: 7G
    logging:
      options:
        max-size: "15g"
        max-file: "4"


  auth:
    build: .
    container_name: auth
    environment:
      - CRAWLAB_FORWARD_PORT
      - HTPASSWD
    ports:
      - "80:8080"  # crawlab
    depends_on:
      - crawlab
    mem_limit: 1G
    logging:
      options:
        max-size: "2g"
        max-file: "5"

Worker-node configuration:

version: '3.5'
services:
  worker:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab_worker
    restart: always
    environment:
      CRAWLAB_LICENSE: "${CRAWLAB_LICENSE}"
      CRAWLAB_NODE_MASTER: "N"  # N: worker node
      CRAWLAB_GRPC_ADDRESS: "${MASTER_NODE_IP}:9666"  # grpc address
      CRAWLAB_FS_FILER_URL: "http://${MASTER_NODE_IP}/api/filer"  # seaweedfs api
    volumes:
      - "/opt/.crawlab/worker:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/worker:/data"  # persistent crawlab data
      - "/opt/crawlab/worker/download:/download" # folder for storing downloaded files
    mem_limit: 7G
    logging:
      options:
        max-size: "3g"
        max-file: "3"

Expected behavior
Add possibility to disable/enable metrics flag, or fix this issue.

Screenshots
image
image

@KrystianJanas KrystianJanas added the bug Something isn't working label Jan 7, 2025
@KrystianJanas
Copy link
Author

@tikazyq please take a look on that. We have created similar issue few months ago, but it has been unfortunately forgotten.

@tikazyq
Copy link
Collaborator

tikazyq commented Jan 9, 2025

Hi @KrystianJanas , thanks for your feedback. Thanks for using Crawlab Pro and I really appreciate your invaluable feedback. I noticed the issue as well but unfortunately there is no quick solution to solve the performance issue potentially caused by the metrics module, as the engine behind is prometheus. If you can, please record the resource consumption metrics (memory, cpu, disk io) for main processes such as crawlab-server, prometheus, weed, etc., so that we can precisely locate the root cause.

In the meantime, we are near a new major release (0.7.0) which is under the final stage of testing before the formal announcement. It is supposed to have addressed the issue you mentioned, given that we have got rid of most 3rd-party middleware dependencies such as Prometheus and SeaweedFS, which are replaced with native Golang code. If you are interested in the EA, please let me know and I'll push to the latest "test" version for your trial.

@tikazyq tikazyq added the performance Performance related label Jan 9, 2025
@KrystianJanas
Copy link
Author

Thanks @tikazyq for your reply.
Yes, I'm interested in EA testing. Please, push the changes and let me know how to use them. We will be really glad of that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Performance related
Projects
None yet
Development

No branches or pull requests

2 participants