New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[DOC] Tokenizer - Edge-n-gram #8378

Merged

kolchfa-aws merged 15 commits into opensearch-project:main from leanneeliatra:tokenizer-edge-n-gram

Jan 3, 2025

Contributor

leanneeliatra commented Sep 25, 2024

Description

Addition of the Tokenizer Edge-n-gram documentation, to the Analyzers section.

Issues Resolved

Part of #1483 addressed in this PR.

Version

All

Frontend features

n/a

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

leanneeliatra and others added 2 commits

September 25, 2024 16:51


          adding doucmentation for n-gram-tokenizer

a20f23c

Signed-off-by: [email protected] <[email protected]>


          Merge branch 'opensearch-project:main' into tokenizer-edge-n-gram

bff8e12

github-actions bot commented Sep 25, 2024

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

github-actions bot assigned kolchfa-aws

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws

vagimeli added the 2 - In progress label

leanneeliatra and others added 2 commits

September 27, 2024 17:00


          Merge branch 'main' into tokenizer-edge-n-gram

8e81fee


          tokenizers edge-n-gram

Signed-off-by: [email protected] <[email protected]>

leanneeliatra marked this pull request as ready for review

September 27, 2024 16:21

leanneeliatra requested review from kolchfa-aws, Naarcha-AWS, vagimeli, AMoo-Miki, natebower, dlvenable, stephen-crawford and epugh as code owners

September 27, 2024 16:21

vagimeli added 3 - Tech review Needs SME analyzers and removed 2 - In progress labels

Contributor

vagimeli commented Sep 30, 2024

@udabhas Will you review this PR for technical accuracy, or have a peer review it? Thank you.

vagimeli added the Content gap label

leanneeliatra and others added 5 commits

October 4, 2024 14:19


          Merge branch 'main' into tokenizer-edge-n-gram

e0478ac


          Merge branch 'main' into tokenizer-edge-n-gram

4352f5c


          format: formatting updates to page

bfba3cd

Signed-off-by: [email protected] <[email protected]>


          format: formatting updates to page

30ee332

Signed-off-by: [email protected] <[email protected]>


          format: formatting updates to page

0d0ab01

Signed-off-by: [email protected] <[email protected]>

udabhas approved these changes

View reviewed changes

udabhas left a comment

Looks good to me.


          Doc review

22cfa31

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws assigned kolchfa-aws and unassigned vagimeli

kolchfa-aws added the backport 2.18 label

natebower requested changes

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws Please see my comments and changes and tag me for approval when complete. Thanks!

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/edge-n-gram.md

		# Edge n-gram tokenizer

		The `edge_ngram` tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum length. This tokenizer is particularly useful for implementing search-as-you-type functionality.

Collaborator

natebower Jan 3, 2025

Line 10: If my rewrite to the second sentence isn't quite right, then let's name what we're referring to: "produces tokens of lengths defined by min_gram and max_gram."

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/edge-n-gram.md

		The `edge_ngram` tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum length. This tokenizer is particularly useful for implementing search-as-you-type functionality.

		Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as with product names or addresses. For more information, see [Autocomplete]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/autocomplete/). However, for text with a fixed order, like movie or song titles, the completion suggester may be more efficient.

Collaborator

natebower Jan 3, 2025

Line 12: "efficient" => "accurate"?

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/edge-n-gram.md Outdated


		## Configuring search as you type

		To implement search-as-you-type functionality, use the `edge_ngram` tokenizer during indexing and a simpler analyzer at search time. The following configuration demonstrates this approach:

Collaborator

natebower Jan 3, 2025

Same comment re: "a simpler analyzer". What/which ones would constitute "a simpler analyzer"?

_analyzers/tokenizers/edge-n-gram.md

+              ```
+              {% include copy-curl.html %}
+              Index a document containing a `product` field and refresh the index:

Collaborator

natebower Jan 3, 2025

This line reads as though we're providing instructions, but line 128 implies merely a demonstration.

Collaborator

kolchfa-aws Jan 3, 2025

Reworded the previous line as instructions

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/edge-n-gram.md

		{% include copy-curl.html %}

		This configuration ensures that the `edge_ngram` tokenizer breaks terms like Laptop into tokens such as `La`, `Lap`, and `Lapt`, allowing partial matches during search. At search time, the standard tokenizer simplifies queries while ensuring matches are case-insensitive because of the lowercase filter.

Collaborator

natebower Jan 3, 2025

Is "the standard tokenizer" what we mean by "a simpler analyzer" on lines 124 and 128?

Collaborator

kolchfa-aws Jan 3, 2025

No, the standard tokenizer.

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed

View reviewed changes

_analyzers/tokenizers/edge-n-gram.md Outdated Show resolved Hide resolved


          Apply suggestions from code review

c15b7f2

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

natebower approved these changes

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws LGTM!

kolchfa-aws and others added 4 commits

January 3, 2025 11:32


          Update _analyzers/tokenizers/edge-n-gram.md

01e7e1b

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Update _analyzers/tokenizers/edge-n-gram.md

afd1444

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Update _analyzers/tokenizers/edge-n-gram.md

24640f9

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>


          Merge branch 'main' into tokenizer-edge-n-gram

38b383f

kolchfa-aws merged commit ca0cc23 into opensearch-project:main

5 checks passed

opensearch-trigger-bot bot pushed a commit that referenced this pull request


          [DOC] Tokenizer - Edge-n-gram (#8378)

ea27c0d

* adding doucmentation for n-gram-tokenizer

Signed-off-by: [email protected] <[email protected]>

* tokenizers edge-n-gram

Signed-off-by: [email protected] <[email protected]>

* format: formatting updates to page

Signed-off-by: [email protected] <[email protected]>

* format: formatting updates to page

Signed-off-by: [email protected] <[email protected]>

* format: formatting updates to page

Signed-off-by: [email protected] <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _analyzers/tokenizers/edge-n-gram.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _analyzers/tokenizers/edge-n-gram.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _analyzers/tokenizers/edge-n-gram.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit ca0cc23)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

opensearch-trigger-bot bot mentioned this pull request

[Backport 2.18] [DOC] Tokenizer - Edge-n-gram #9006

Merged

github-actions bot pushed a commit that referenced this pull request


          [DOC] Tokenizer - Edge-n-gram (#8378) (#9006)

5f9f9ac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kolchfa-aws kolchfa-aws left review comments

natebower natebower approved these changes

udabhas udabhas approved these changes

Naarcha-AWS Awaiting requested review from Naarcha-AWS Naarcha-AWS is a code owner

vagimeli Awaiting requested review from vagimeli

AMoo-Miki Awaiting requested review from AMoo-Miki AMoo-Miki is a code owner

dlvenable Awaiting requested review from dlvenable dlvenable is a code owner

stephen-crawford Awaiting requested review from stephen-crawford

epugh Awaiting requested review from epugh epugh is a code owner

Labels

3 - Tech review analyzers backport 2.18 Content gap Needs SME