spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

ciorg · 2021-07-14T20:54:53Z

Found that the spaces_reader returns too many records unless using 1 worker and 1 slicer, also reducing the interval to 1s independent of the number of workers or slicers returns close to the correct number of records but is still a bit too high.

Used a control group of data of 6.95M records in all the tests.

Tests were ran with elasticsearch-asset version 2.6.2, node-12 on dataeng3, teraslice version 0.76.1

workers	slicers	interval	docs returned (M)
20	10	auto	8.38
20	1	auto	7.81
1	1	auto	6.95
20	10	1s	6.97
20	10	1m	8.19
20	10	1hr	8.48

Ran a job with 20 workers and 10 slicers, interval auto that deduped the records and the count came to 6.95M, so it looks like it's picking up duplicate records.

The text was updated successfully, but these errors were encountered:

ciorg · 2021-07-15T18:11:54Z

here's my basic job config for the job running on de3:

{
    "name": "temp#spaces_reader",
    "lifecycle": "once",
    "workers": 20,
    "slicers": 10,
    "assets": [
        "elasticsearch:2.6.2"
    ],
    "operations": [
        {
            "_op": "spaces_reader",
            "interval": "1m",
            "endpoint": "ENDPOINT/api/v2",
            "index": "INDEX_NAME",
            "token": "TOKEN",
            "size": 100000,
            "date_field_name": "date",
            "start": "2021-05-01T00:00:00.000Z",
            "end": "2021-05-01T01:00:00.000Z",
            "query": "_exists_:date"
        },
        {
            "_op": "noop"
        }
    ]
}

peterdemartini · 2021-07-16T16:02:37Z

Can you confirm the with the latest asset bundle? 2.7.2?

ciorg · 2021-07-16T22:02:42Z

Tried it with 2.7.2 and got the same result - 8.4M records when I expect 6.95

"name": "temp#spaces_reader",
    "lifecycle": "once",
    "workers": 20,
    "slicers": 10,
    "assets": [
        "elasticsearch:2.7.2"
    ],
    "operations": [
        {
            "_op": "spaces_reader",
            "interval": "auto",
            "endpoint": "ENDPOINT",
            "index": "INDEX",
            "token": "TOKEN",
            "size": 100000,
            "date_field_name": "date",
            "start": "2021-05-01T00:00:00.000Z",
            "end": "2021-05-01T01:00:00.000Z",
            "query": "_exists_:date"
        },
        {
            "_op": "noop"
        }
    ],

ciorg added the bug Something isn't working label Jul 14, 2021

peterdemartini added the priority:high label Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

ciorg commented Jul 14, 2021

ciorg commented Jul 15, 2021 •

edited

Loading

peterdemartini commented Jul 16, 2021

ciorg commented Jul 16, 2021

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

Comments

ciorg commented Jul 14, 2021

ciorg commented Jul 15, 2021 • edited Loading

peterdemartini commented Jul 16, 2021

ciorg commented Jul 16, 2021

ciorg commented Jul 15, 2021 •

edited

Loading