[DOC] Tokenizers - Character group #8350

leanneeliatra · 2024-09-23T15:26:39Z

Description

Addition of the Tokenizer - Character group documentation.
Added in the Analyzers section

Issues Resolved

Part of #1483 addressed in this PR.

Version

all

Frontend features

n/a

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: [email protected] <[email protected]>

github-actions · 2024-09-23T15:26:51Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: [email protected] <[email protected]>

vagimeli · 2024-09-24T15:55:27Z

@udabhas Will you review this PR for technical accuracy, or have a peer review it? Thank you.

Signed-off-by: [email protected] <[email protected]>

_analyzers/tokenizers/character-group-tokenizer.md

Signed-off-by: Melissa Vagi <[email protected]>

_analyzers/tokenizers/character-group-tokenizer.md

Signed-off-by: Melissa Vagi <[email protected]>

_analyzers/tokenizers/character-group-tokenizer.md

Signed-off-by: Melissa Vagi <[email protected]>

_analyzers/tokenizers/character-group-tokenizer.md

vagimeli

Doc review complete

Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]>

udabhas · 2024-12-09T07:42:04Z

_analyzers/tokenizers/character-group-tokenizer.md

+The following response shows that the specified characters have been removed: 
+
+```
+Fast cars drive fast


This returns fast!

all tokens -

Fast cars drive fast!

udabhas · 2024-12-09T09:05:33Z

_analyzers/tokenizers/character-group-tokenizer.md

+
+The character group tokenizer accepts the following parameters:
+
+1. `tokenize_on_chars`: Specifies a set of characters on which the text should be tokenized. The tokenizer creates a new token upon encountering any character from the specified set, for example, single characters `(e.g., -, @)` and character classes such as `whitespace`, `letter`, `digit`, `punctuation`, and `symbol`.


also would be good to add that this tokenizer accepts escape characters

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@kolchfa-aws @leanneeliatra LGTM!

* finished tokenizers example Signed-off-by: [email protected] <[email protected]> * updating nav order Signed-off-by: [email protected] <[email protected]> * layout cleanup Signed-off-by: [email protected] <[email protected]> * grammar fix Signed-off-by: [email protected] <[email protected]> * doc: small update for page numbers Signed-off-by: [email protected] <[email protected]> * layout fix: correct scentence case for all examples Signed-off-by: [email protected] <[email protected]> * small update: adding copy tag for json segment Signed-off-by: [email protected] <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Reorder index Signed-off-by: Fanit Kolchina <[email protected]> * Add escape characters Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> (cherry picked from commit b52ec2f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* finished tokenizers example Signed-off-by: [email protected] <[email protected]> * updating nav order Signed-off-by: [email protected] <[email protected]> * layout cleanup Signed-off-by: [email protected] <[email protected]> * grammar fix Signed-off-by: [email protected] <[email protected]> * doc: small update for page numbers Signed-off-by: [email protected] <[email protected]> * layout fix: correct scentence case for all examples Signed-off-by: [email protected] <[email protected]> * small update: adding copy tag for json segment Signed-off-by: [email protected] <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/tokenizers/character-group-tokenizer.md Signed-off-by: Melissa Vagi <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Reorder index Signed-off-by: Fanit Kolchina <[email protected]> * Add escape characters Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: Eric Pugh <[email protected]>

finished tokenizers example

1fd9a7e

Signed-off-by: [email protected] <[email protected]>

leanneeliatra marked this pull request as ready for review September 23, 2024 15:26

leanneeliatra requested review from AMoo-Miki, Naarcha-AWS, dlvenable, epugh, kolchfa-aws, natebower, stephen-crawford and vagimeli as code owners September 23, 2024 15:26

github-actions bot assigned kolchfa-aws Sep 23, 2024

updating nav order

0a849c8

Signed-off-by: [email protected] <[email protected]>

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws Sep 23, 2024

Merge branch 'main' into tokenizer-

2a7b8d1

vagimeli added Tech review PR: Tech review in progress Needs SME Content gap labels Sep 24, 2024

leanneeliatra and others added 8 commits October 4, 2024 14:18

Merge branch 'main' into tokenizer-

87ddc07

Merge branch 'main' into tokenizer-

c99229c

layout cleanup

ecb5e5f

Signed-off-by: [email protected] <[email protected]>

grammar fix

828e7fc

Signed-off-by: [email protected] <[email protected]>

Merge branch 'main' into tokenizer-

540c6d7

doc: small update for page numbers

def3c62

Signed-off-by: [email protected] <[email protected]>

layout fix: correct scentence case for all examples

cef551a

Signed-off-by: [email protected] <[email protected]>

small update: adding copy tag for json segment

d4c1cc4

Signed-off-by: [email protected] <[email protected]>

vagimeli reviewed Oct 15, 2024

View reviewed changes