Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

akolhun · 2024-12-05T15:51:27Z

Describe the bug
While document is stored with a field value = "ÜRÜNLERİ" - it cannot be then found by exacly same keyword "ÜRÜNLERİ" (but gets found by "ÜRÜNLERI" )

To Reproduce
Given the schema as

schema test_schema {
    document test_schema {

        field sku type string {
            indexing: summary | attribute
            match {
              word
            }
        }

    }
}

And document indexed as :

{
    "fields": {
        "sku": "ÜRÜNLERİ"
    }
}

Then the following search query does return the doc

"yql": "select * from test_schema where sku contains 'ÜRÜNLERİ'",

but this one with "incorrect" "I" returns the doc:

"yql": "select * from test_schema where sku contains 'ÜRÜNLERI'",

Expected behavior
search returns the doc for search term "ÜRÜNLERİ'"

Environment
docker image: vespaengine/vespa:8.452.13

Vespa version
8.452.13

Additional context
Issue might be reproduced within the app package attached:
vespa_encoding_issue.zip

Indexing request:

curl --location 'http://localhost:8080/document/v1/test/test_schema/docid/test_doc_123' \
--header 'Content-Type: application/json' \
--data '{
    "fields": {
        "sku": "ÜRÜNLERİ"
    }
}
'

Search request:

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "user": "ak",
    "yql": "select * from test_schema where sku contains '\''ÜRÜNLERİ'\''"
}'

The text was updated successfully, but these errors were encountered:

akolhun · 2024-12-05T16:35:54Z

There was also verified an approach with an explicit language set during both feeding and search:

...
        field language type string {
            indexing: summary | attribute | set_language
            attribute: fast-search
            match: word
        }

Then saving the doc with language = 'tr-TR' and searching it as:

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "user": "ak",
    "yql": "select * from test_schema where sku contains '\''ÜRÜNLERİ'\''",
    "language": "tr-TR"
}'

does not succeed either

We asume the issue happens at feeding level while lowercasing the value.
See utf-16 decial code of the "İ" letter:

I: 73
ı: 305
İ: 304
i: 105

Now in Turkish alphabet lowercased I (73) is ı (305)
while in English: lowercased I (73) is i (105)

jobergum · 2024-12-06T09:23:53Z

Hey, thanks for the detailed ticket. Attribute fields are not subject to linguistic processing at indexing or query time, so this is unrelated to language settings/set_language. This issue is related to case folding, as using match:cased works well.

field sku type string {
            indexing: summary | attribute
            match:cased

 }

Tracing with tracelevel=9 using cased matching, avoids the faulty lowercasing in the container

vespa query 'yql=select * from msmarco where sku contains "ÜRÜNLERİ"' 'tracelevel=9'

 {
                                "message": "msmarco.num0 search to dispatch: query=[sku:ÜRÜNLERİ] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=c168cd80-a971-4c95-bb58-e058dcd61332.1733475402182.9.default grouping=0 :  restrict=[msmarco]"
                            },
                            {
                                "message": "Current state of query tree: EXACTSTRING[fromSegmented=false index=\"sku\" origin=null segmentIndex=0 stemmed=false uniqueID=1 words=true]{\n  \"ÜRÜNLERİ\"\n}\n"
                            },
  "attribute": {
                                                                "[type]": "IAttributeVector",
                                                                "name": "sku",
                                                                "type": "string",
                                                                "fast_search": false,
                                                                "filter": false
                                                            },
                                                            "query_term": "\u00DCR\u00DCNLER\u0130"
                                                        },

Without case matching (default) you get the following trace:

 {
                                "message": "msmarco.num0 search to dispatch: query=[sku:ürünleri̇] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=c168cd80-a971-4c95-bb58-e058dcd61332.1733475567095.11.default grouping=0 :  restrict=[msmarco]"
                            },
                            {
                                "message": "Current state of query tree: EXACTSTRING[fromSegmented=false index=\"sku\" origin=null segmentIndex=0 stemmed=false uniqueID=1 words=true]{\n  \"ürünleri̇\"\n}\n"
                            },
                            {
                                "message": "YQL+ representation: select * from msmarco where sku contains ({normalizeCase: false, id: 1}\"\\u00FCr\\u00FCnleri\\u0307\") timeout 9998"
                            },

So it looks like the lowercasing in the stateless container layer is the issue here.

akolhun changed the title ~~Search is not correctly working for words containing special charaters like Turkish "İ"~~ Search is not correctly working for words containing non-common charaters like Turkish "İ" Dec 5, 2024

bjormel assigned bjormel and unassigned bjormel Dec 6, 2024

kkraune added question and removed question labels Dec 10, 2024

hmusum added this to the soon milestone Dec 11, 2024

hmusum assigned toregge Dec 11, 2024

hmusum changed the title ~~Search is not correctly working for words containing non-common charaters like Turkish "İ"~~ Search is not correctly working for words containing non-common characters like Turkish "İ" Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

akolhun commented Dec 5, 2024

akolhun commented Dec 5, 2024

jobergum commented Dec 6, 2024

Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

Comments

akolhun commented Dec 5, 2024

akolhun commented Dec 5, 2024

jobergum commented Dec 6, 2024