Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit d8aed6c

Browse files
authored
Merge pull request #678 from joshuar/fix_analyze_API_usage
Update Getting Started with Languages section.
2 parents 40a746f + e926454 commit d8aed6c

7 files changed

+190
-130
lines changed

200_Language_intro/00_Intro.asciidoc

+7-14
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,10 @@
22
== Getting Started with Languages
33

44
Elasticsearch ships with a collection of language analyzers that provide
5-
good, basic, out-of-the-box ((("language analyzers")))((("languages", "getting started with")))support for many of the world's most common
6-
languages:
5+
good, basic, https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html[out-of-the-box support]
6+
for many of the world's most common languages.
77

8-
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese,
9-
Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek,
10-
Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish,
11-
Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish,
12-
Turkish, and Thai.
13-
14-
These analyzers typically((("language analyzers", "roles performed by"))) perform four roles:
8+
These analyzers typically perform four roles:
159

1610
* Tokenize text into individual words:
1711
+
@@ -30,19 +24,18 @@ These analyzers typically((("language analyzers", "roles performed by"))) perfor
3024
`foxes` -> `fox`
3125

3226
Each analyzer may also apply other transformations specific to its language in
33-
order to make words from that((("language analyzers", "other transformations specific to the language"))) language more searchable:
27+
order to make words from that language more searchable:
3428

35-
* The `english` analyzer ((("english analyzer")))removes the possessive `'s`:
29+
* The `english` analyzer removes the possessive `'s`:
3630
+
3731
`John's` -> `john`
3832

39-
* The `french` analyzer ((("french analyzer")))removes _elisions_ like `l'` and `qu'` and
33+
* The `french` analyzer removes _elisions_ like `l'` and `qu'` and
4034
_diacritics_ like `¨` or `^`:
4135
+
4236
`l'église` -> `eglis`
4337

44-
* The `german` analyzer normalizes((("german analyzer"))) terms, replacing `ä` and `ae` with `a`, or
38+
* The `german` analyzer normalizes terms, replacing `ä` and `ae` with `a`, or
4539
`ß` with `ss`, among others:
4640
+
4741
`äußerst` -> `ausserst`
48-

200_Language_intro/10_Using.asciidoc

+26-16
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
=== Using Language Analyzers
33

44
The built-in language analyzers are available globally and don't need to be
5-
configured before being used.((("language analyzers", "using"))) They can be specified directly in the field
5+
configured before being used. They can be specified directly in the field
66
mapping:
77

88
[source,js]
@@ -13,26 +13,33 @@ PUT /my_index
1313
"blog": {
1414
"properties": {
1515
"title": {
16-
"type": "string",
16+
"type": "text",
1717
"analyzer": "english" <1>
1818
}
1919
}
2020
}
2121
}
2222
}
2323
--------------------------------------------------
24+
// CONSOLE
25+
2426
<1> The `title` field will use the `english` analyzer instead of the default
2527
`standard` analyzer.
2628

27-
Of course, by passing ((("english analyzer", "information lost with")))text through the `english` analyzer, we lose
28-
information:
29+
Of course, by passing text through the `english` analyzer, we lose information:
2930

3031
[source,js]
3132
--------------------------------------------------
32-
GET /my_index/_analyze?field=title <1>
33-
I'm not happy about the foxes
33+
GET /my_index/_analyze
34+
{
35+
"field": "title"
36+
"text": "I'm not happy about the foxes" <1>
37+
}
3438
--------------------------------------------------
35-
<1> Emits token: `i'm`, `happi`, `about`, `fox`
39+
// CONSOLE
40+
// TEST[continued]
41+
42+
<1> Emits the tokens: `i'm`, `happi`, `about`, `fox`
3643

3744
We can't tell if the document mentions one `fox` or many `foxes`; the word
3845
`not` is a stopword and is removed, so we can't tell whether the document is
@@ -41,7 +48,7 @@ recall as we can match more loosely, but we have reduced our ability to rank
4148
documents accurately.
4249

4350
To get the best of both worlds, we can use <<multi-fields,multifields>> to
44-
index the `title` field twice: once((("multifields", "using to index a field with two different analyzers"))) with the `english` analyzer and once with
51+
index the `title` field twice: once with the `english` analyzer and once with
4552
the `standard` analyzer:
4653

4754
[source,js]
@@ -52,10 +59,10 @@ PUT /my_index
5259
"blog": {
5360
"properties": {
5461
"title": { <1>
55-
"type": "string",
62+
"type": "text",
5663
"fields": {
5764
"english": { <2>
58-
"type": "string",
65+
"type": "text",
5966
"analyzer": "english"
6067
}
6168
}
@@ -65,6 +72,8 @@ PUT /my_index
6572
}
6673
}
6774
--------------------------------------------------
75+
// CONSOLE
76+
6877
<1> The main `title` field uses the `standard` analyzer.
6978
<2> The `title.english` subfield uses the `english` analyzer.
7079

@@ -90,12 +99,13 @@ GET /_search
9099
}
91100
}
92101
--------------------------------------------------
102+
// CONSOLE
103+
// TEST[continued]
104+
93105
<1> Use the <<most-fields,`most_fields`>> query type to match the
94106
same text in as many fields as possible.
95107

96-
Even ((("most fields queries")))though neither of our documents contain the word `foxes`, both documents
97-
are returned as results thanks to the word stemming on the `title.english`
98-
field. The second document is ranked as more relevant, because the word `not`
99-
matches on the `title` field.
100-
101-
108+
Even though neither of our documents contain the
109+
word `foxes`, both documents are returned as results thanks to the word
110+
stemming on the `title.english` field. The second document is ranked as more
111+
relevant, because the word `not` matches on the `title` field.

200_Language_intro/20_Configuring.asciidoc

+11-7
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22
=== Configuring Language Analyzers
33

44
While the language analyzers can be used out of the box without any
5-
configuration, most of them ((("english analyzer", "configuring")))((("language analyzers", "configuring")))do allow you to control aspects of their
5+
configuration, most of them do allow you to control aspects of their
66
behavior, specifically:
77

88
[[stem-exclusion]]
99
Stem-word exclusion::
1010
+
11-
Imagine, for instance, that users searching for((("language analyzers", "configuring", "stem word exclusion")))((("stemming words", "stem word exclusion, configuring"))) the ``World Health
11+
Imagine, for instance, that users searching for the ``World Health
1212
Organization'' are instead getting results for ``organ health.'' The reason
1313
for this confusion is that both ``organ'' and ``organization'' are stemmed to
1414
the same root word: `organ`. Often this isn't a problem, but in this
@@ -18,7 +18,7 @@ stemmed.
1818

1919
Custom stopwords::
2020

21-
The default list of stopwords((("stopwords", "configuring for language analyzers"))) used in English are as follows:
21+
The default list of stopwords used in English are as follows:
2222
+
2323
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
2424
no, not, of, on, or, such, that, the, their, then, there, these,
@@ -54,13 +54,17 @@ PUT /my_index
5454
}
5555
}
5656
57-
GET /my_index/_analyze?analyzer=my_english <3>
58-
The World Health Organization does not sell organs.
57+
GET /my_index/_analyze
58+
{
59+
"analyzer": "my_english", <3>
60+
"text": "The World Health Organization does not sell organs."
61+
}
5962
--------------------------------------------------
63+
// CONSOLE
64+
6065
<1> Prevents `organization` and `organizations` from being stemmed
6166
<2> Specifies a custom list of stopwords
62-
<3> Emits tokens `world`, `health`, `organization`, `does`, `not`, `sell`, `organ`
67+
<3> Emits tokens `world`, `health`, `organization`, `doe`, `not`, `sell`, `organ`
6368

6469
We discuss stemming and stopwords in much more detail in <<stemming>> and
6570
<<stopwords>>, respectively.
66-

200_Language_intro/30_Language_pitfalls.asciidoc

+15-15
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
[[language-pitfalls]]
22
=== Pitfalls of Mixing Languages
33

4-
If you have to deal with only a single language,((("languages", "mixing, pitfalls of"))) count yourself lucky.
4+
If you have to deal with only a single language, count yourself lucky.
55
Finding the right strategy for handling documents written in several languages
6-
can be challenging.((("indexing", "mixed languages, pitfalls of")))
6+
can be challenging.
77

88
==== At Index Time
99

@@ -21,11 +21,11 @@ separate. Mixing languages in the same inverted index can be problematic.
2121
===== Incorrect stemming
2222

2323
The stemming rules for German are different from those for English, French,
24-
Swedish, and so on.((("stemming words", "incorrect stemming in multilingual documents"))) Applying the same stemming rules to different languages
24+
Swedish, and so on. Applying the same stemming rules to different languages
2525
will result in some words being stemmed correctly, some incorrectly, and some
26-
not being stemmed at all. It may even result in words from different languages with different meanings
27-
being stemmed to the same root word, conflating their meanings and producing
28-
confusing search results for the user.
26+
not being stemmed at all. It may even result in words from different languages
27+
with different meanings being stemmed to the same root word, conflating their
28+
meanings and producing confusing search results for the user.
2929

3030
Applying multiple stemmers in turn to the same text is likely to result in
3131
rubbish, as the next stemmer may try to stem an already stemmed word,
@@ -49,7 +49,7 @@ text.
4949
===== Incorrect inverse document frequencies
5050

5151
In <<relevance-intro>>, we explained that the more frequently a term appears
52-
in a collection of documents, the less weight that term has.((("inverse document frequency", "incorrect, in multilingual documents"))) For accurate
52+
in a collection of documents, the less weight that term has. For accurate
5353
relevance calculations, you need accurate term-frequency statistics.
5454

5555
A short snippet of German appearing in predominantly English text would give
@@ -59,11 +59,11 @@ snippets now have much less weight.
5959

6060
==== At Query Time
6161

62-
It is not sufficient just to think about your documents, though.((("queries", "mixed languages and"))) You also need
63-
to think about how your users will query those documents. Often you will be able
64-
to identify the main language of the user either from the language of that user's chosen
65-
interface (for example, `mysite.de` versus `mysite.fr`) or from the
66-
http://www.w3.org/International/questions/qa-lang-priorities.en.php[`accept-language`]
62+
It is not sufficient just to think about your documents, though. You also need
63+
to think about how your users will query those documents. Often you will be
64+
able to identify the main language of the user either from the language of that
65+
user's chosen interface (for example, `mysite.de` versus `mysite.fr`) or from
66+
the http://www.w3.org/International/questions/qa-lang-priorities.en.php[`accept-language`]
6767
HTTP header from the user's browser.
6868

6969
User searches also come in three main varieties:
@@ -72,7 +72,8 @@ User searches also come in three main varieties:
7272
* Users search for words in a different language, but expect results in
7373
their main language.
7474
* Users search for words in a different language, and expect results in
75-
that language (for example, a bilingual person, or a foreign visitor in a web cafe).
75+
that language (for example, a bilingual person, or a foreign visitor in a web
76+
cafe).
7677

7778
Depending on the type of data that you are searching, it may be appropriate to
7879
return results in a single language (for example, a user searching for products on
@@ -102,7 +103,7 @@ library from
102103
http://blog.mikemccandless.com/2013/08/a-new-version-of-compact-language.html[Mike McCandless],
103104
which uses the open source (http://www.apache.org/licenses/LICENSE-2.0[Apache License 2.0])
104105
https://code.google.com/p/cld2/[Compact Language Detector] (CLD) from Google. It is
105-
small, fast, ((("Compact Language Detector (CLD)")))and accurate, and can detect 160+ languages from as little as two
106+
small, fast, and accurate, and can detect 160+ languages from as little as two
106107
sentences. It can even detect multiple languages within a single block of
107108
text. Bindings exist for several languages including Python, Perl, JavaScript,
108109
PHP, C#/.NET, and R.
@@ -113,4 +114,3 @@ Shorter amounts of text, such as search keywords, produce much less accurate
113114
results. In these cases, it may be preferable to take simple heuristics into
114115
account such as the country of origin, the user's selected language, and the
115116
HTTP `accept-language` headers.
116-

0 commit comments

Comments
 (0)