Slice error when writing to rolling opensearch cluster #970

godber · 2022-11-08T22:20:30Z

We were rolling an internal 1.3.* Opensearch cluster today and noticed that we started getting slice errors during the roll. Specifically we were getting this error:

class NoLivingConnectionsError extends OpenSearchClientError {
  constructor(message, meta) {
    super(message);
    Error.captureStackTrace(this, NoLivingConnectionsError);
    this.name = 'NoLivingConnectionsError';
    this.message =
      message ||
      'Given the configuration, the ConnectionPool was not able to find a usable Connection for this request.';
    this.meta = meta;
  }
}

https://github.com/opensearch-project/opensearch-js/blob/871a6669c9153d8161b3bbbce8747b86f9e6f758/lib/errors.js#L66

The text was updated successfully, but these errors were encountered:

godber · 2022-11-08T23:25:58Z

The settings on the connector are

          es_os_data1:
            node:
            - http://es-os-data1.qa:9200
            sniffInterval: 60000
            sniffOnConnectionFault: false
            sniffOnStart: false
            suggestCompression: false

And the problem isn't just "during roll" ... slice errors can happen AFTER a roll is complete on a job that has been paused, then resumed well after the roll is complete. The annotations below note when the cluster roll starts and is completed:

So it's like the client behavior has changed and we need to set sniffOnConnectionFault: true now.

Realistically, this is not that big of a surprise that a behavior like this would change. We'll test these scenarios when someone frees up.

kstaken · 2022-11-09T00:35:22Z

That makes sense, if you have both sniffing options turned off then it would likely never know about any of the other nodes beyond the one it connects to so there would be nothing to fail over to. However, I would expect the old client to fail in this case too so if it doesn't that's surprising to me.

godber · 2022-11-09T00:47:21Z

This is probably relevant code, and this is the release we use:

https://github.com/opensearch-project/opensearch-js/blob/1.1.0/lib/Transport.js#L495-L534

It's possible we just need to make a config change here. I had been reluctant to enable sniff on fault before because I didn't know what error handling/retry situation might exist in the client code ... and I didn't want an ES/OS cluster problem to cause ALL of the Teraslice workers to start sniffing at once and exacerbate things. Though that fear was always hypothetical.

We'll dig in with more details tomorrow, as well as testing the ES6 and ES7 specific cases with the new 3.3.0 asset.

godber · 2022-12-21T00:11:23Z

Description

These instructions should make it possible to reproduce this ES asset issue. There is a branch on https://github.com/terascope/teraslice that has the following things to help you reproduce this:

updated docker-compose.yaml file
test jobs
test configs

You can start by updating teraslice and checking out the testRollBug2 branch as follows then following the rest these instructions. Skim down to the Errors if you just want to see whats in the logs.

Setup

git fetch
git checkout testRollBug2

Start up necessary containers:

docker compose down -v  # clean up any old containers from this docker-compose file
docker compose up --detach --force-recreate  # start containers, always recreated

Wait for kafka broker to come up then create kafka topic:

docker exec -it teraslice-kafka-1 kafka-topics.sh --create --partitions 10 --replication-factor 1 --topic test1 --zookeeper kafka:2181

Setup teraslice-cli (earl) and load Teraslice load assets

earl aliases add localhost http://localhost:5678
earl assets deploy localhost terascope/kafka-assets --bundle
earl assets deploy localhost terascope/[email protected] --bundle
earl assets deploy localhost terascope/[email protected] --bundle
earl assets deploy localhost terascope/standard-assets --bundle

Check ES Assets

curl -sS localhost:5678/txt/assets

Register jobs

earl tjm register localhost examples/jobs/data_generator_to_kafka.json
earl tjm register localhost examples/jobs/kafka_to_es2.json
earl tjm register localhost examples/jobs/kafka_to_es3.json

Check Teraslice Jobs and Elasticsearch Clusters

# check teraslice jobs
curl -sS localhost:5678/txt/jobs

# Check ES Data Cluster (empty)
curl -sS localhost:9200/_cat/indices?v
# Check ES State Cluster (state indices)
curl -sS localhost:9201/_cat/indices?v

The initial setup of the test environment is now complete.

Failing Test Scenario

This is the scenario that doesn't work. When using the new version of the ES
asset, (> v3.0), the teraslice job will start having slice errors and still
commit offsets when the ES node is restarting.

earl tjm start examples/jobs/data_generator_to_kafka.json
earl tjm start examples/jobs/kafka_to_es3.json

Now if you restart the ES data cluster node, you will see the job accumulating
slice errors.

docker restart teraslice-elasticsearch-data1-1

# watch logs, note that the job continues
docker logs teraslice-teraslice-master-1 -f | bunyan
docker logs teraslice-teraslice-worker-1 -f | bunyan

Now you can see the errors on the execution controller (found in either the
teraslice-teraslice-master-1 or teraslice-teraslice-worker-1) logs.

[2022-12-20T23:49:45.860Z] ERROR: teraslice/221 on 92c9409f4559: (assignment=execution_controller, module=execution_controller, worker_id=pvcsXGNo, ex_id=a5011bcf-4767-4cd0-bd43-469a4bdf44b4, job_id=f4dfafab-73c8-4b81-95dc-b0037da03d1a)
    worker: 172.30.0.6__oToqavyH has failure completing its slice {
      analytics: {
        time: [ 2393, 13, 1130 ],
        memory: [ 5804000, 179424, 25901352 ],
        size: [ 10000, 10000, 10000 ]
      },
      error: 'TSError: Slice failed processing, caused by TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
        '    at Slice._markFailed (/app/source/packages/teraslice/lib/workers/worker/slice.js:137:15)\n' +
        '    at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:48:17)\n' +
        '    at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)\n' +
        '    at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)\n' +
        'Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
        '    at pRetry (/app/source/packages/utils/dist/src/promises.js:109:21)\n' +
        '    at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:39:22)\n' +
        '    at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)\n' +
        '    at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)\n' +
        '    at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)\n' +
        '    at processTicksAndRejections (internal/process/task_queues.js:95:5)\n' +
        'Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
        '    at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)\n' +
        '    at processTicksAndRejections (internal/process/task_queues.js:95:5)\n' +
        'Caused by: ConnectionError: connect ECONNREFUSED 172.30.0.5:9200\n' +
        '    at ClientRequest.onError (/app/source/node_modules/@opensearch-project/opensearch/lib/Connection.js:126:16)\n' +
        '    at ClientRequest.emit (events.js:400:28)\n' +
        '    at Socket.socketErrorListener (_http_client.js:475:9)\n' +
        '    at Socket.emit (events.js:400:28)\n' +
        '    at emitErrorNT (internal/streams/destroy.js:106:8)\n' +
        '    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n' +
        '    at processTicksAndRejections (internal/process/task_queues.js:82:21)',
      slice: {
        slice_id: '8714f369-9348-40a9-a2ce-a7b158eb16f0',
        slicer_id: 0,
        slicer_order: 11,
        request: {},
        _created: '2022-12-20T23:49:32.959Z'
      }
    }

The error on a single slice is as follows:

curl -sS localhost:9201/teracluster__state-2022.12/_search?q=state:error | jq -r .hits.hits[0]._source.error
TSError: connect ECONNREFUSED 172.30.0.5:9200
    at pRetry (/app/source/packages/utils/dist/src/promises.js:109:21)
    at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:39:22)
    at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)
    at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)
    at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200
    at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
Caused by: ConnectionError: connect ECONNREFUSED 172.30.0.5:9200
    at ClientRequest.onError (/app/source/node_modules/@opensearch-project/opensearch/lib/Connection.js:126:16)
    at ClientRequest.emit (events.js:400:28)
    at Socket.socketErrorListener (_http_client.js:475:9)
    at Socket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)

There were about five slices with errors in my test setup. If you run the
Non-Failing Test Scenario (which uses an older ES asset) you will see that the
job just pauses until Elasticsearch comes back up. It continues without any
slice errors.

Cleaning up you can now shut down the jobs.

earl tjm stop examples/jobs/data_generator_to_kafka.json
earl tjm stop examples/jobs/kafka_to_es3.json

Non Failing Test Scenario

Now we can start with the scenario where the old version of the Elasticsearch
asset behaves as desired. In this case, when we disrupt (restart) the
Elasticsearch data node, the Teraslice job continues to run without erroring
until the ES node comes back up.

earl tjm start examples/jobs/data_generator_to_kafka.json
earl tjm start examples/jobs/kafka_to_es2.json

docker restart teraslice-elasticsearch-data1-1

# watch logs, note that the job continues
docker logs teraslice-teraslice-master-1 -f | bunyan
docker logs teraslice-teraslice-worker-1 -f | bunyan

Shutdown the test jobs

earl tjm stop examples/jobs/data_generator_to_kafka.json
earl tjm stop examples/jobs/kafka_to_es2.json

godber · 2023-01-04T20:38:19Z

@jsnoble in the "Failing Test Scenario" I describe above, I call docker restart teraslice-elasticsearch-data1-1 to restart the Elasticsearch. Doublechecking the config in the Draft PR Here:

https://github.com/terascope/teraslice/pull/3329/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R76

I can confirm that the problem I describe was reproduced specifically with Elasticsearch 7.9.3.

godber · 2023-01-05T18:36:38Z

Just to offer a little clarification on which ES cluster is restarted ... and which Teraslice job gets errors, check this out ... I had TWO ES clusters, "state" and "data". I restart JUST the data cluster and see the errors on the downstream job that writes to that data cluster.

jsnoble · 2023-01-19T22:02:07Z

This should be addressed in v3..4.0

godber · 2023-01-20T00:06:49Z

I've repeated my tests using the testRollBug2 as described above, but using the new 3.4.0 asset and the job resumes correctly now.

cp examples/jobs/kafka_to_es3.json examples/jobs/kafka_to_es3-new-asset.json
# edit new job to use 3.4.0 asset, update all `3-3-0` reference too
vim examples/jobs/kafka_to_es3-new-asset.json
earl assets deploy localhost terascope/[email protected] --bundle
earl tjm register localhost examples/jobs/kafka_to_es3-new-asset.json
earl tjm start examples/jobs/kafka_to_es3-new-asset.json

Now you can restart the elasticsearch data node container and see the job pause ... it doesn't raise any errors in its logs, but the job does recover and continue.

Great job Jared!

godber · 2023-03-01T20:26:51Z

I think we're still seeing this when the data cluster in the diagram above is Opensearch 1.3.*.

We'll come back with more info when we have it.

godber · 2023-03-03T17:57:08Z

@briend can you rework my example on the testRollBug2 branch to test OpenSearch instead of Elasticsearch and see if this issue still exists in that scenario?

briend · 2023-03-06T23:07:43Z

I made a branch here based on testRollBug2 that uses opensearch:1.3.6: https://github.com/terascope/teraslice/tree/testRollBug3.4.1

I couldn't yet reproduce the error following the same steps you did above, when using the es 3.4.1 asset. I was able to reproduce the same error when using the 3.3.0 es asset with opensearch, however.

I even tried adjusting the connection options to match the environment we saw the error, but that didn't change the outcome.

######  Slicers  ######

name                     job_id                                workers_available  workers_active  failed  queued  processed
-----------------------  ------------------------------------  -----------------  --------------  ------  ------  ---------
Kafka To ES 3.4.1        0b4dd7be-681a-419d-b618-d3b457480309  0                  1               0       2       230      
Data Generator To Kafka  fbb1893c-8c62-44e8-a3db-803b5584fb08  0                  1               0       3       93

curl -Ss localhost:9200
{
  "name" : "f043fa3d5f29",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "6PyE2WmqRwi6MjM1XDNe7w",
  "version" : {
    "distribution" : "opensearch",
    "number" : "1.3.6",
    "build_type" : "tar",
    "build_hash" : "cbf74db21db3eb4d79c43caeafc23eec592bf697",
    "build_date" : "2022-10-04T20:26:16.847170Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

I figured it might be a timing/chance thing, so I restarted the data cluster every 45 seconds for a while...

while true; do docker restart teraslice-elasticsearch-data1-1; sleep 45; done
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1
teraslice-elasticsearch-data1-1

The job was still able to plug along without errors, however:

######  Slicers  ######

name                     job_id                                workers_available  workers_active  failed  queued  processed
-----------------------  ------------------------------------  -----------------  --------------  ------  ------  ---------
Kafka To ES 3.4.1        0b4dd7be-681a-419d-b618-d3b457480309  0                  1               0       2       262      
Data Generator To Kafka  fbb1893c-8c62-44e8-a3db-803b5584fb08  0                  1               0       3       176

godber added the bug Something isn't working label Nov 8, 2022

kstaken assigned jsnoble Nov 8, 2022

kstaken added the priority:high label Nov 8, 2022

godber mentioned this issue Jan 4, 2023

Test roll bug2 terascope/teraslice#3329

Closed

jsnoble closed this as completed Jan 19, 2023

godber reopened this Mar 1, 2023

godber assigned briend Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slice error when writing to rolling opensearch cluster #970

Slice error when writing to rolling opensearch cluster #970

godber commented Nov 8, 2022

godber commented Nov 8, 2022

kstaken commented Nov 9, 2022

godber commented Nov 9, 2022

godber commented Dec 21, 2022 •

edited

Loading

godber commented Jan 4, 2023

godber commented Jan 5, 2023

jsnoble commented Jan 19, 2023

godber commented Jan 20, 2023

godber commented Mar 1, 2023

godber commented Mar 3, 2023

briend commented Mar 6, 2023

Slice error when writing to rolling opensearch cluster #970

Slice error when writing to rolling opensearch cluster #970

Comments

godber commented Nov 8, 2022

godber commented Nov 8, 2022

kstaken commented Nov 9, 2022

godber commented Nov 9, 2022

godber commented Dec 21, 2022 • edited Loading

Description

Setup

Failing Test Scenario

Non Failing Test Scenario

godber commented Jan 4, 2023

godber commented Jan 5, 2023

jsnoble commented Jan 19, 2023

godber commented Jan 20, 2023

godber commented Mar 1, 2023

godber commented Mar 3, 2023

briend commented Mar 6, 2023

godber commented Dec 21, 2022 •

edited

Loading