-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Tokenizer - Edge-n-gram #8378
[DOC] Tokenizer - Edge-n-gram #8378
Conversation
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
@udabhas Will you review this PR for technical accuracy, or have a peer review it? Thank you. |
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Please see my comments and changes and tag me for approval when complete. Thanks!
# Edge n-gram tokenizer | ||
|
||
The `edge_ngram` tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum length. This tokenizer is particularly useful for implementing search-as-you-type functionality. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 10: If my rewrite to the second sentence isn't quite right, then let's name what we're referring to: "produces tokens of lengths defined by min_gram
and max_gram
."
The `edge_ngram` tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum length. This tokenizer is particularly useful for implementing search-as-you-type functionality. | ||
|
||
Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as with product names or addresses. For more information, see [Autocomplete]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/autocomplete/). However, for text with a fixed order, like movie or song titles, the completion suggester may be more efficient. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 12: "efficient" => "accurate"?
_analyzers/tokenizers/edge-n-gram.md
Outdated
|
||
## Configuring search as you type | ||
|
||
To implement search-as-you-type functionality, use the `edge_ngram` tokenizer during indexing and a simpler analyzer at search time. The following configuration demonstrates this approach: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment re: "a simpler analyzer". What/which ones would constitute "a simpler analyzer"?
``` | ||
{% include copy-curl.html %} | ||
|
||
Index a document containing a `product` field and refresh the index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line reads as though we're providing instructions, but line 128 implies merely a demonstration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded the previous line as instructions
{% include copy-curl.html %} | ||
|
||
This configuration ensures that the `edge_ngram` tokenizer breaks terms like Laptop into tokens such as `La`, `Lap`, and `Lapt`, allowing partial matches during search. At search time, the standard tokenizer simplifies queries while ensuring matches are case-insensitive because of the lowercase filter. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "the standard tokenizer" what we mean by "a simpler analyzer" on lines 124 and 128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the standard
tokenizer.
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws LGTM!
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
* adding doucmentation for n-gram-tokenizer Signed-off-by: [email protected] <[email protected]> * tokenizers edge-n-gram Signed-off-by: [email protected] <[email protected]> * format: formatting updates to page Signed-off-by: [email protected] <[email protected]> * format: formatting updates to page Signed-off-by: [email protected] <[email protected]> * format: formatting updates to page Signed-off-by: [email protected] <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/tokenizers/edge-n-gram.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/tokenizers/edge-n-gram.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/tokenizers/edge-n-gram.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit ca0cc23) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Addition of the Tokenizer Edge-n-gram documentation, to the Analyzers section.
Issues Resolved
Part of #1483 addressed in this PR.
Version
All
Frontend features
n/a
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.