-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slice error when writing to rolling opensearch cluster #970
Comments
The settings on the connector are
And the problem isn't just "during roll" ... slice errors can happen AFTER a roll is complete on a job that has been paused, then resumed well after the roll is complete. The annotations below note when the cluster roll starts and is completed: So it's like the client behavior has changed and we need to set Realistically, this is not that big of a surprise that a behavior like this would change. We'll test these scenarios when someone frees up. |
That makes sense, if you have both sniffing options turned off then it would likely never know about any of the other nodes beyond the one it connects to so there would be nothing to fail over to. However, I would expect the old client to fail in this case too so if it doesn't that's surprising to me. |
This is probably relevant code, and this is the release we use: https://github.com/opensearch-project/opensearch-js/blob/1.1.0/lib/Transport.js#L495-L534 It's possible we just need to make a config change here. I had been reluctant to enable sniff on fault before because I didn't know what error handling/retry situation might exist in the client code ... and I didn't want an ES/OS cluster problem to cause ALL of the Teraslice workers to start sniffing at once and exacerbate things. Though that fear was always hypothetical. We'll dig in with more details tomorrow, as well as testing the ES6 and ES7 specific cases with the new 3.3.0 asset. |
DescriptionThese instructions should make it possible to reproduce this ES asset issue. There is a branch on https://github.com/terascope/teraslice that has the following things to help you reproduce this:
You can start by updating Setupgit fetch
git checkout testRollBug2 Start up necessary containers: docker compose down -v # clean up any old containers from this docker-compose file
docker compose up --detach --force-recreate # start containers, always recreated Wait for kafka broker to come up then create kafka topic: docker exec -it teraslice-kafka-1 kafka-topics.sh --create --partitions 10 --replication-factor 1 --topic test1 --zookeeper kafka:2181 Setup earl aliases add localhost http://localhost:5678
earl assets deploy localhost terascope/kafka-assets --bundle
earl assets deploy localhost terascope/[email protected] --bundle
earl assets deploy localhost terascope/[email protected] --bundle
earl assets deploy localhost terascope/standard-assets --bundle Check ES Assets curl -sS localhost:5678/txt/assets Register jobs earl tjm register localhost examples/jobs/data_generator_to_kafka.json
earl tjm register localhost examples/jobs/kafka_to_es2.json
earl tjm register localhost examples/jobs/kafka_to_es3.json Check Teraslice Jobs and Elasticsearch Clusters # check teraslice jobs
curl -sS localhost:5678/txt/jobs
# Check ES Data Cluster (empty)
curl -sS localhost:9200/_cat/indices?v
# Check ES State Cluster (state indices)
curl -sS localhost:9201/_cat/indices?v The initial setup of the test environment is now complete. Failing Test ScenarioThis is the scenario that doesn't work. When using the new version of the ES earl tjm start examples/jobs/data_generator_to_kafka.json
earl tjm start examples/jobs/kafka_to_es3.json Now if you restart the ES data cluster node, you will see the job accumulating docker restart teraslice-elasticsearch-data1-1
# watch logs, note that the job continues
docker logs teraslice-teraslice-master-1 -f | bunyan
docker logs teraslice-teraslice-worker-1 -f | bunyan Now you can see the errors on the execution controller (found in either the [2022-12-20T23:49:45.860Z] ERROR: teraslice/221 on 92c9409f4559: (assignment=execution_controller, module=execution_controller, worker_id=pvcsXGNo, ex_id=a5011bcf-4767-4cd0-bd43-469a4bdf44b4, job_id=f4dfafab-73c8-4b81-95dc-b0037da03d1a)
worker: 172.30.0.6__oToqavyH has failure completing its slice {
analytics: {
time: [ 2393, 13, 1130 ],
memory: [ 5804000, 179424, 25901352 ],
size: [ 10000, 10000, 10000 ]
},
error: 'TSError: Slice failed processing, caused by TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
' at Slice._markFailed (/app/source/packages/teraslice/lib/workers/worker/slice.js:137:15)\n' +
' at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:48:17)\n' +
' at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)\n' +
' at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)\n' +
'Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
' at pRetry (/app/source/packages/utils/dist/src/promises.js:109:21)\n' +
' at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:39:22)\n' +
' at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)\n' +
' at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)\n' +
' at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)\n' +
' at processTicksAndRejections (internal/process/task_queues.js:95:5)\n' +
'Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200\n' +
' at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)\n' +
' at processTicksAndRejections (internal/process/task_queues.js:95:5)\n' +
'Caused by: ConnectionError: connect ECONNREFUSED 172.30.0.5:9200\n' +
' at ClientRequest.onError (/app/source/node_modules/@opensearch-project/opensearch/lib/Connection.js:126:16)\n' +
' at ClientRequest.emit (events.js:400:28)\n' +
' at Socket.socketErrorListener (_http_client.js:475:9)\n' +
' at Socket.emit (events.js:400:28)\n' +
' at emitErrorNT (internal/streams/destroy.js:106:8)\n' +
' at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n' +
' at processTicksAndRejections (internal/process/task_queues.js:82:21)',
slice: {
slice_id: '8714f369-9348-40a9-a2ce-a7b158eb16f0',
slicer_id: 0,
slicer_order: 11,
request: {},
_created: '2022-12-20T23:49:32.959Z'
}
} The error on a single slice is as follows: curl -sS localhost:9201/teracluster__state-2022.12/_search?q=state:error | jq -r .hits.hits[0]._source.error
TSError: connect ECONNREFUSED 172.30.0.5:9200
at pRetry (/app/source/packages/utils/dist/src/promises.js:109:21)
at async Slice.run (/app/source/packages/teraslice/lib/workers/worker/slice.js:39:22)
at async Worker.runOnce (/app/source/packages/teraslice/lib/workers/worker/index.js:164:13)
at async _run (/app/source/packages/teraslice/lib/workers/worker/index.js:111:17)
at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Caused by: TSError: connect ECONNREFUSED 172.30.0.5:9200
at _errorHandlerFn (/app/assets/608dc2dcfe2e76854e9e209e0dae8a43832a9ea5/index.js:341394:15)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Caused by: ConnectionError: connect ECONNREFUSED 172.30.0.5:9200
at ClientRequest.onError (/app/source/node_modules/@opensearch-project/opensearch/lib/Connection.js:126:16)
at ClientRequest.emit (events.js:400:28)
at Socket.socketErrorListener (_http_client.js:475:9)
at Socket.emit (events.js:400:28)
at emitErrorNT (internal/streams/destroy.js:106:8)
at emitErrorCloseNT (internal/streams/destroy.js:74:3)
at processTicksAndRejections (internal/process/task_queues.js:82:21) There were about five slices with errors in my test setup. If you run the Cleaning up you can now shut down the jobs. earl tjm stop examples/jobs/data_generator_to_kafka.json
earl tjm stop examples/jobs/kafka_to_es3.json Non Failing Test ScenarioNow we can start with the scenario where the old version of the Elasticsearch earl tjm start examples/jobs/data_generator_to_kafka.json
earl tjm start examples/jobs/kafka_to_es2.json
docker restart teraslice-elasticsearch-data1-1
# watch logs, note that the job continues
docker logs teraslice-teraslice-master-1 -f | bunyan
docker logs teraslice-teraslice-worker-1 -f | bunyan Shutdown the test jobs earl tjm stop examples/jobs/data_generator_to_kafka.json
earl tjm stop examples/jobs/kafka_to_es2.json |
@jsnoble in the "Failing Test Scenario" I describe above, I call I can confirm that the problem I describe was reproduced specifically with Elasticsearch 7.9.3. |
This should be addressed in v3..4.0 |
I've repeated my tests using the
Now you can restart the elasticsearch data node container and see the job pause ... it doesn't raise any errors in its logs, but the job does recover and continue. Great job Jared! |
I think we're still seeing this when the data cluster in the diagram above is Opensearch 1.3.*. We'll come back with more info when we have it. |
@briend can you rework my example on the |
I made a branch here based on I couldn't yet reproduce the error following the same steps you did above, when using the es I even tried adjusting the connection options to match the environment we saw the error, but that didn't change the outcome.
I figured it might be a timing/chance thing, so I restarted the data cluster every 45 seconds for a while...
The job was still able to plug along without errors, however:
|
We were rolling an internal
1.3.*
Opensearch cluster today and noticed that we started getting slice errors during the roll. Specifically we were getting this error:https://github.com/opensearch-project/opensearch-js/blob/871a6669c9153d8161b3bbbce8747b86f9e6f758/lib/errors.js#L66
The text was updated successfully, but these errors were encountered: