Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit 5598b2b

Browse files
committed
Edited 210_Identifying_words/20_Standard_tokenizer.asciidoc with Atlas code editor
1 parent eb309ad commit 5598b2b

File tree

1 file changed

+19
-22
lines changed

1 file changed

+19
-22
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
[[standard-tokenizer]]
2-
=== standard tokenizer
2+
=== standard Tokenizer
33

4-
A tokenizer accepts a string as input, processes((("words", "identifying", "using standard tokenizer")))((("standard tokenizer")))((("tokenizers"))) the string to break it up
5-
into individual words or _tokens_ (perhaps discarding some characters like
4+
A _tokenizer_ accepts a string as input, processes((("words", "identifying", "using standard tokenizer")))((("standard tokenizer")))((("tokenizers"))) the string to break it
5+
into individual words, or _tokens_ (perhaps discarding some characters like
66
punctuation), and emits a _token stream_ as output.
77

8-
What is interesting is the algorithm that is used to *identify* words. The
9-
`whitespace` tokenizer ((("whitespace tokenizer")))simply breaks on whitespace -- spaces, tabs, line
10-
feeds etc. -- and assumes that contiguous non-whitespace characters form a
8+
What is interesting is the algorithm that is used to _identify_ words. The
9+
`whitespace` tokenizer ((("whitespace tokenizer")))simply breaks on whitespace--spaces, tabs, line
10+
feeds, and so forth--and assumes that contiguous nonwhitespace characters form a
1111
single token. For instance:
1212

1313
[source,js]
@@ -16,40 +16,39 @@ GET /_analyze?tokenizer=whitespace
1616
You're the 1st runner home!
1717
--------------------------------------------------
1818

19-
The above request would return the following terms:
19+
This request would return the following terms:
2020
`You're`, `the`, `1st`, `runner`, `home!`
2121

22-
The `letter` tokenizer, on the other hand, breaks on any character which is
22+
The `letter` tokenizer, on the other hand, breaks on any character that is
2323
not a letter, and so would ((("letter tokenizer")))return the following terms: `You`, `re`, `the`,
24-
`st`, `runner`,`home`
24+
`st`, `runner`, `home`.
2525

2626
The `standard` tokenizer((("Unicode Text Segmentation algorithm"))) uses the Unicode Text Segmentation algorithm (as
2727
defined in http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) to
28-
find the boundaries *between* words,((("word boundaries"))) and emits everything in-between. Its
28+
find the boundaries _between_ words,((("word boundaries"))) and emits everything in-between. Its
2929
knowledge of Unicode allows it to successfully tokenize text containing a
3030
mixture of languages.
3131

32-
Punctuation may((("punctuation", "in words"))) or may not be considered to be part of a word, depending on
32+
Punctuation may((("punctuation", "in words"))) or may not be considered part of a word, depending on
3333
where it appears:
3434

3535
[source,js]
3636
--------------------------------------------------
3737
GET /_analyze?tokenizer=standard
38-
You're my 'favourite'.
38+
You're my 'favorite'.
3939
--------------------------------------------------
4040

41-
In the above example, the apostrophe in `You're` is treated as part of the
42-
word while the single quotes in `'favourite'` are not, resulting in the
43-
following terms: `You're`, `my`, `favourite`.
41+
In this example, the apostrophe in `You're` is treated as part of the
42+
word, while the single quotes in `'favorite'` are not, resulting in the
43+
following terms: `You're`, `my`, `favorite`.
4444

4545
[TIP]
46-
.`uax_url_email` tokenizer
4746
==================================================
4847
4948
The `uax_url_email` tokenizer works((("uax_url_email tokenizer"))) in exactly the same way as the `standard`
50-
tokenizer, except that it recognises((("email addresses and URLs, tokenizer for"))) email addresses and URLs as emits them as
49+
tokenizer, except that it recognizes((("email addresses and URLs, tokenizer for"))) email addresses and URLs and emits them as
5150
single tokens. The `standard` tokenizer, on the other hand, would try to
52-
break them up into individual words. For instance, the email address
51+
break them into individual words. For instance, the email address
5352
`joe-bloggs@foo-bar.com` would result in the tokens `joe`, `bloggs`, `foo`,
5453
`bar.com`.
5554
@@ -58,7 +57,5 @@ break them up into individual words. For instance, the email address
5857
The `standard` tokenizer is a reasonable starting point for tokenizing most
5958
languages, especially Western languages. In fact, it forms the basis of most
6059
of the language-specific analyzers like the `english`, `french`, and `spanish`
61-
analyzers.
62-
63-
Its support for Asian languages, however, is limited and you should consider
64-
using the `icu_tokenizer` instead,((("icu_tokenizer"))) which is available in the ICU plugin.
60+
analyzers. Its support for Asian languages, however, is limited, and you should consider
61+
using the `icu_tokenizer` instead,((("icu_tokenizer"))) which is available in the ICU plug-in.

0 commit comments

Comments
 (0)