1
- :ref: http://foo.com/
2
-
3
1
[[using-stopwords]]
4
2
=== Using stopwords
5
3
6
4
The removal of stopwords is handled by the
7
5
{ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used
8
- when creating a `custom` analyzer, as described below in <<stop-token-filter>>.
9
- However, some out-of-the-box analyzers have the `stop` filter integrated
10
- already:
6
+ when creating a `custom` analyzer (see <<stop-token-filter>> below).
7
+ However, some out-of-the-box analyzers come with the `stop` filter pre-integrated:
11
8
12
9
{ref}analysis-lang-analyzer.html[Language analyzers]::
13
10
@@ -28,7 +25,7 @@ already:
28
25
29
26
To use custom stopwords in conjunction with the `standard` analyzer, all we
30
27
need to do is to create a configured version of the analyzer and pass in the
31
- list of `stopwords that we require:
28
+ list of `stopwords` that we require:
32
29
33
30
[source,json]
34
31
---------------------------------
@@ -39,19 +36,21 @@ PUT /my_index
39
36
"analyzer": {
40
37
"my_analyzer": { <1>
41
38
"type": "standard", <2>
42
- "stopwords": [ <3>
43
- "and",<3>
44
- "the"
45
- ]
46
- }}}}}
39
+ "stopwords": [ "and", "the" ] <3>
40
+ }
41
+ }
42
+ }
43
+ }
44
+ }
47
45
---------------------------------
48
46
<1> This is a custom analyzer called `my_analyzer`.
49
47
<2> This analyzer is the `standard` analyzer with some custom configuration.
50
48
<3> The stopwords to filter out are `and` and `the`.
51
49
52
- TIP: The same technique can be used to configure custom stopword lists for
50
+ TIP: This same technique can be used to configure custom stopword lists for
53
51
any of the language analyzers.
54
52
53
+ [[maintaining-positions]]
55
54
==== Maintaining positions
56
55
57
56
The output from the `analyze` API is quite interesting:
@@ -92,6 +91,7 @@ important for phrase queries -- if the positions of each term had been
92
91
adjusted, then a phrase query for `"quick dead"` would have matched the above
93
92
example incorrectly.
94
93
94
+ [[specifying-stopwords]]
95
95
==== Specifying stopwords
96
96
97
97
Stopwords can be passed inline, as we did in the previous example, by
@@ -150,23 +150,27 @@ PUT /my_index
150
150
"analyzer": {
151
151
"my_english": {
152
152
"type": "english",
153
- "stopwords_path": "config/ stopwords/english.txt" <1>
153
+ "stopwords_path": "stopwords/english.txt" <1>
154
154
}
155
155
}
156
156
}
157
157
}
158
158
}
159
159
---------------------------------
160
- <1> The path to the stopwords file, relative to the Elasticsearch directory.
160
+ <1> The path to the stopwords file, relative to the Elasticsearch `config`
161
+ directory.
161
162
162
163
[[stop-token-filter]]
163
164
==== Using the `stop` token filter
164
165
165
- The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used
166
- directly when you need to create a `custom` analyzer. For instance, let's say
167
- that we wanted to create a Spanish analyzer with a custom stopwords list
168
- and the `light_spanish` stemmer, which also
169
- <<asciifolding-token-filter,removes diacritics>>.
166
+ The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be combined
167
+ with a tokenizer and other token filters when you need to create a `custom`
168
+ analyzer. For instance, let's say that we wanted to create a Spanish analyzer
169
+ with:
170
+
171
+ * a custom stopwords list.
172
+ * the `light_spanish` stemmer.
173
+ * the <<asciifolding-token-filter,`asciifolding` filter>> to remove diacritics.
170
174
171
175
We could set that up as follows:
172
176
@@ -203,22 +207,22 @@ PUT /my_index
203
207
---------------------------------
204
208
<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
205
209
parameters as the `standard` analyzer.
206
- <2> See <<using-an- algorithmic-stemmer >>.
207
- <3> The order of token filters is important, see below.
210
+ <2> See <<algorithmic-stemmers >>.
211
+ <3> The order of token filters is important, as explained below.
208
212
209
- The `spanish_stop` filter comes after the `asciifolding` filter. This means
210
- that `esta`, `èsta ` and ++està ++ will first have their diacritics removed to
211
- become just `esta`, which is removed as a stopword. If, instead, we wanted to
212
- remove `esta` and `èsta `, but not ++està ++, then we would have to put the
213
- `spanish_stop` filter *before* the `asciifolding` filter, and specify both
214
- words in the stopwords list.
213
+ We have placed the `spanish_stop` filter after the `asciifolding` filter. This
214
+ means that `esta`, `ésta ` and ++está ++ will first have their diacritics
215
+ removed to become just `esta`, which will then be removed as a stopword. If,
216
+ instead, we wanted to remove `esta` and `ésta `, but not ++está ++, then we
217
+ would have to put the `spanish_stop` filter *before* the `asciifolding`
218
+ filter, and specify both words in the stopwords list.
215
219
216
220
[[updating-stopwords]]
217
221
==== Updating stopwords
218
222
219
223
There are a few techniques which can be used to update the list of stopwords
220
- in use . Analyzers are instantiated at index creation time, when a node is
221
- restarted, or when a closed index is reopened.
224
+ used by an analyzer . Analyzers are instantiated at index creation time, when a
225
+ node is restarted, or when a closed index is reopened.
222
226
223
227
If you specify stopwords inline with the `stopwords` parameter, then your
224
228
only option is to close the index, update the analyzer configuration with the
@@ -227,13 +231,13 @@ the index.
227
231
228
232
Updating stopwords is easier if you specify them in a file with the
229
233
`stopwords_path` parameter. You can just update the file (on every node in
230
- the cluster) then force the analyzers to be recreated by:
234
+ the cluster) then force the analyzers to be recreated by either :
231
235
232
236
* closing and reopening the index
233
237
(see {ref}indices-open-close.html[open/close index]), or
234
238
* restarting each node in the cluster, one by one.
235
239
236
240
Of course, updating the stopwords list will not change any documents that have
237
- already been indexed. It will only apply to searches and to new or updated
241
+ already been indexed -- it will only apply to searches and to new or updated
238
242
documents. To apply the changes to existing documents you will need to
239
243
reindex your data. See <<reindex>>
0 commit comments