Detected PII word's "start" and "end" are returning the wrong positions #96

flaviabeo · 2024-06-25T15:08:01Z

Describe the bug

The start and end fields returned are different from expected. For example, the e-mail is not in the mask position returned by the request response from the detected PII.

Platform

Please provide details about the environment you are using, including the following:

GR Version 2.0 NLP Client, TLS

Sample Code

POST call to /api/v1/task/classification-with-text-generation with the payload:

{
    "inputs": "I hate cats they are stupid. I love dogs [email protected]. Rabbits are pretty but",
    "model_id": "bloom-560m",
    "guardrail_config": {
        "input": {
            "models": {
                "en_syntax_rbr_pii": {
                    "threshold": 0.8
                }
            }
        },
        "output": {
            "models": {}
        }
    },
    "text_gen_parameters": {
        "preserve_input_text": true,
        "max_new_tokens": 99,
        "min_new_tokens": 1,
        "truncate_input_tokens": 500,
        "decoding_method": "SAMPLING",
        "top_k": 2,
        "top_p": 0.9,
        "typical_p": 0.5,
        "temperature": 0.8,
        "seed": 42,
        "repetition_penalty": 2,
        "max_time": 0,
        "stop_sequences": [
            "42"
        ]
    }
}

Expected behavior

{
        "start": 41,
        "end": 54,
        "word": "[email protected]",
        "entity": "EmailAddress",
        "entity_group": "",
        "score": 0.8
  }

Observed behavior

{
    "token_classification_results": {
        "input": [
            {
                "start": 12,
                "end": 25,
                "word": "they are stup",
                "entity": "EmailAddress",
                "entity_group": "pii",
                "score": 0.8
            }
        ]
    },
    "input_token_count": 21,
    "warnings": [
        {
            "id": "UNSUITABLE_INPUT",
            "message": "Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."
        }
    ]
}

The text was updated successfully, but these errors were encountered:

gkumbhat · 2024-06-25T15:54:33Z

Thanks @flaviabeo . This should have been fixed with: #47

evaline-ju · 2024-06-25T15:56:00Z

What version of orchestrator is being used here? edit: was able to reproduce on latest (b3231a5)

The results are correct if a sentence chunker is not used, which points to an issue potentially still with the offsets...

evaline-ju · 2024-06-25T17:35:20Z

Root cause was that the detectors used are not returning an output per input i.e. each text in contents did not necessarily return a result. Since the orchestrator does not know which "chunk" or text a result corresponds to, the offset is calculated incorrectly. This has to be fixed on the detector server side.

In the meantime it was noticed that with #76 we are doing unnecessary codepoint slicing again to determine text

evaline-ju · 2024-06-25T21:19:52Z

With the merge of #102 , there will be a clear error message and 500 return until this is updated for the requested detector(s)

flaviabeo · 2024-06-26T19:10:33Z

Fixed by the PR #102!

gkumbhat added the bug Something isn't working label Jun 25, 2024

evaline-ju mentioned this issue Jun 25, 2024

🔥 Remove unnecessary text slice #98

Merged

evaline-ju self-assigned this Jun 25, 2024

This was referenced Jun 25, 2024

Add a check to verify if the len of input and output match for detectors #100

Closed

🥅 Handle unexpected detector responses #102

Merged

flaviabeo closed this as completed Jun 26, 2024

This was linked to pull requests Jun 26, 2024

🔥 Remove unnecessary text slice #98

Merged

🥅 Handle unexpected detector responses #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detected PII word's "start" and "end" are returning the wrong positions #96

Detected PII word's "start" and "end" are returning the wrong positions #96

flaviabeo commented Jun 25, 2024 •

edited

Loading

gkumbhat commented Jun 25, 2024

evaline-ju commented Jun 25, 2024 •

edited

Loading

evaline-ju commented Jun 25, 2024 •

edited

Loading

evaline-ju commented Jun 25, 2024

flaviabeo commented Jun 26, 2024

Detected PII word's "start" and "end" are returning the wrong positions #96

Detected PII word's "start" and "end" are returning the wrong positions #96

Comments

flaviabeo commented Jun 25, 2024 • edited Loading

Describe the bug

Platform

Sample Code

Expected behavior

Observed behavior

gkumbhat commented Jun 25, 2024

evaline-ju commented Jun 25, 2024 • edited Loading

evaline-ju commented Jun 25, 2024 • edited Loading

evaline-ju commented Jun 25, 2024

flaviabeo commented Jun 26, 2024

flaviabeo commented Jun 25, 2024 •

edited

Loading

evaline-ju commented Jun 25, 2024 •

edited

Loading

evaline-ju commented Jun 25, 2024 •

edited

Loading