Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected PII word's "start" and "end" are returning the wrong positions #96

Closed
flaviabeo opened this issue Jun 25, 2024 · 5 comments · Fixed by #98 or #102
Closed

Detected PII word's "start" and "end" are returning the wrong positions #96

flaviabeo opened this issue Jun 25, 2024 · 5 comments · Fixed by #98 or #102
Assignees
Labels
bug Something isn't working

Comments

@flaviabeo
Copy link
Collaborator

flaviabeo commented Jun 25, 2024

Describe the bug

The start and end fields returned are different from expected. For example, the e-mail is not in the mask position returned by the request response from the detected PII.

Platform

Please provide details about the environment you are using, including the following:

GR Version 2.0 NLP Client, TLS

Sample Code

POST call to /api/v1/task/classification-with-text-generation with the payload:

{
    "inputs": "I hate cats they are stupid. I love dogs [email protected]. Rabbits are pretty but",
    "model_id": "bloom-560m",
    "guardrail_config": {
        "input": {
            "models": {
                "en_syntax_rbr_pii": {
                    "threshold": 0.8
                }
            }
        },
        "output": {
            "models": {}
        }
    },
    "text_gen_parameters": {
        "preserve_input_text": true,
        "max_new_tokens": 99,
        "min_new_tokens": 1,
        "truncate_input_tokens": 500,
        "decoding_method": "SAMPLING",
        "top_k": 2,
        "top_p": 0.9,
        "typical_p": 0.5,
        "temperature": 0.8,
        "seed": 42,
        "repetition_penalty": 2,
        "max_time": 0,
        "stop_sequences": [
            "42"
        ]
    }
}

Expected behavior

{
        "start": 41,
        "end": 54,
        "word": "[email protected]",
        "entity": "EmailAddress",
        "entity_group": "",
        "score": 0.8
  }

Observed behavior

{
    "token_classification_results": {
        "input": [
            {
                "start": 12,
                "end": 25,
                "word": "they are stup",
                "entity": "EmailAddress",
                "entity_group": "pii",
                "score": 0.8
            }
        ]
    },
    "input_token_count": 21,
    "warnings": [
        {
            "id": "UNSUITABLE_INPUT",
            "message": "Unsuitable input detected. Please check the detected entities on your input and try again with the unsuitable input removed."
        }
    ]
}
@gkumbhat gkumbhat added the bug Something isn't working label Jun 25, 2024
@gkumbhat
Copy link
Collaborator

Thanks @flaviabeo . This should have been fixed with: #47

@evaline-ju
Copy link
Collaborator

evaline-ju commented Jun 25, 2024

What version of orchestrator is being used here? edit: was able to reproduce on latest (b3231a5)

The results are correct if a sentence chunker is not used, which points to an issue potentially still with the offsets...

@evaline-ju
Copy link
Collaborator

evaline-ju commented Jun 25, 2024

Root cause was that the detectors used are not returning an output per input i.e. each text in contents did not necessarily return a result. Since the orchestrator does not know which "chunk" or text a result corresponds to, the offset is calculated incorrectly. This has to be fixed on the detector server side.

In the meantime it was noticed that with #76 we are doing unnecessary codepoint slicing again to determine text

@evaline-ju
Copy link
Collaborator

With the merge of #102 , there will be a clear error message and 500 return until this is updated for the requested detector(s)

@flaviabeo
Copy link
Collaborator Author

Fixed by the PR #102!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants