You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 21, 2021. It is now read-only.
A tokenizer accepts a string as input, processes((("words", "identifying", "using standard tokenizer")))((("standard tokenizer")))((("tokenizers"))) the string to break it up
5
-
into individual words or _tokens_ (perhaps discarding some characters like
4
+
A _tokenizer_ accepts a string as input, processes((("words", "identifying", "using standard tokenizer")))((("standard tokenizer")))((("tokenizers"))) the string to break it
5
+
into individual words, or _tokens_ (perhaps discarding some characters like
6
6
punctuation), and emits a _token stream_ as output.
7
7
8
-
What is interesting is the algorithm that is used to *identify* words. The
9
-
`whitespace` tokenizer ((("whitespace tokenizer")))simply breaks on whitespace -- spaces, tabs, line
10
-
feeds etc. -- and assumes that contiguous non-whitespace characters form a
8
+
What is interesting is the algorithm that is used to _identify_ words. The
9
+
`whitespace` tokenizer ((("whitespace tokenizer")))simply breaks on whitespace--spaces, tabs, line
10
+
feeds, and so forth--and assumes that contiguous nonwhitespace characters form a
11
11
single token. For instance:
12
12
13
13
[source,js]
@@ -16,40 +16,39 @@ GET /_analyze?tokenizer=whitespace
0 commit comments