Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit ab0cdc9

Browse files
Added Stopwords chapter
1 parent ae0fbfb commit ab0cdc9

9 files changed

+749
-41
lines changed

100_Full_Text_Search/10_Multi_word_queries.asciidoc

+1
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ The important thing to take away from the above is that any document whose
7373
`title` field contains *at least one of the specified terms* will match the
7474
query. The more terms that match, the more relevant the document.
7575

76+
[[match-improving-precision]]
7677
==== Improving precision
7778

7879
Matching any document which contains *any* of the query terms may result in a

240_Stopwords.asciidoc

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
21
include::240_Stopwords/10_Intro.asciidoc[]
32

43
include::240_Stopwords/20_Using_stopwords.asciidoc[]
54

5+
include::240_Stopwords/30_Stopwords_and_performance.asciidoc[]
6+
7+
include::240_Stopwords/40_Divide_and_conquer.asciidoc[]
68

7-
common terms query
8-
match query
9+
include::240_Stopwords/50_Phrase_queries.asciidoc[]
910

10-
relevance
11+
include::240_Stopwords/60_Common_grams.asciidoc[]
1112

12-
bm25
13+
include::240_Stopwords/70_Relevance.asciidoc[]
1314

14-
common grams token filter

240_Stopwords/10_Intro.asciidoc

+6-4
Original file line numberDiff line numberDiff line change
@@ -51,16 +51,18 @@ stopwords used in Elasticsearch are:
5151
These _stopwords_ can usually be filtered out before indexing with little
5252
negative impact on retrieval. But is it a good idea to do so?
5353

54+
[[pros-cons-stopwords]]
5455
[float]
5556
=== Pros and cons of stopwords
5657

5758
We have more disk space, more RAM, and better compression algorithms than
5859
existed back in the day. Excluding the above 33 common words from the index
5960
will only save about 4MB per million documents. Using stopwords for the sake
60-
of reducing index size is no longer a valid reason.
61+
of reducing index size is no longer a valid reason. (Although, there is one
62+
caveat to this statement which we will discuss in <<stopwords-phrases>>.)
6163

6264
On top of that, by removing words from the index we are reducing our ability
63-
to perform certain types of search. Filtering out the above stopwords
65+
to perform certain types of search. Filtering out the words listed above
6466
prevents us from:
6567

6668
* distinguishing ``happy'' from ``not happy''.
@@ -78,8 +80,8 @@ the `_score` for all 1 million documents. This second query simply cannot
7880
perform as well as the first.
7981

8082
Fortunately, there are techniques which we can use to keep common words
81-
searchable, while benefiting from the performance gain of stopwords. First,
82-
let's start with how to use stopwords.
83+
searchable, while still maintaining good performance. First, we'll start with
84+
how to use stopwords.
8385

8486

8587

240_Stopwords/20_Using_stopwords.asciidoc

+35-31
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,10 @@
1-
:ref: http://foo.com/
2-
31
[[using-stopwords]]
42
=== Using stopwords
53

64
The removal of stopwords is handled by the
75
{ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used
8-
when creating a `custom` analyzer, as described below in <<stop-token-filter>>.
9-
However, some out-of-the-box analyzers have the `stop` filter integrated
10-
already:
6+
when creating a `custom` analyzer (see <<stop-token-filter>> below).
7+
However, some out-of-the-box analyzers come with the `stop` filter pre-integrated:
118

129
{ref}analysis-lang-analyzer.html[Language analyzers]::
1310

@@ -28,7 +25,7 @@ already:
2825

2926
To use custom stopwords in conjunction with the `standard` analyzer, all we
3027
need to do is to create a configured version of the analyzer and pass in the
31-
list of `stopwords that we require:
28+
list of `stopwords` that we require:
3229

3330
[source,json]
3431
---------------------------------
@@ -39,19 +36,21 @@ PUT /my_index
3936
"analyzer": {
4037
"my_analyzer": { <1>
4138
"type": "standard", <2>
42-
"stopwords": [ <3>
43-
"and",<3>
44-
"the"
45-
]
46-
}}}}}
39+
"stopwords": [ "and", "the" ] <3>
40+
}
41+
}
42+
}
43+
}
44+
}
4745
---------------------------------
4846
<1> This is a custom analyzer called `my_analyzer`.
4947
<2> This analyzer is the `standard` analyzer with some custom configuration.
5048
<3> The stopwords to filter out are `and` and `the`.
5149

52-
TIP: The same technique can be used to configure custom stopword lists for
50+
TIP: This same technique can be used to configure custom stopword lists for
5351
any of the language analyzers.
5452

53+
[[maintaining-positions]]
5554
==== Maintaining positions
5655

5756
The output from the `analyze` API is quite interesting:
@@ -92,6 +91,7 @@ important for phrase queries -- if the positions of each term had been
9291
adjusted, then a phrase query for `"quick dead"` would have matched the above
9392
example incorrectly.
9493

94+
[[specifying-stopwords]]
9595
==== Specifying stopwords
9696

9797
Stopwords can be passed inline, as we did in the previous example, by
@@ -150,23 +150,27 @@ PUT /my_index
150150
"analyzer": {
151151
"my_english": {
152152
"type": "english",
153-
"stopwords_path": "config/stopwords/english.txt" <1>
153+
"stopwords_path": "stopwords/english.txt" <1>
154154
}
155155
}
156156
}
157157
}
158158
}
159159
---------------------------------
160-
<1> The path to the stopwords file, relative to the Elasticsearch directory.
160+
<1> The path to the stopwords file, relative to the Elasticsearch `config`
161+
directory.
161162

162163
[[stop-token-filter]]
163164
==== Using the `stop` token filter
164165

165-
The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used
166-
directly when you need to create a `custom` analyzer. For instance, let's say
167-
that we wanted to create a Spanish analyzer with a custom stopwords list
168-
and the `light_spanish` stemmer, which also
169-
<<asciifolding-token-filter,removes diacritics>>.
166+
The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be combined
167+
with a tokenizer and other token filters when you need to create a `custom`
168+
analyzer. For instance, let's say that we wanted to create a Spanish analyzer
169+
with:
170+
171+
* a custom stopwords list.
172+
* the `light_spanish` stemmer.
173+
* the <<asciifolding-token-filter,`asciifolding` filter>> to remove diacritics.
170174

171175
We could set that up as follows:
172176

@@ -203,22 +207,22 @@ PUT /my_index
203207
---------------------------------
204208
<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
205209
parameters as the `standard` analyzer.
206-
<2> See <<using-an-algorithmic-stemmer>>.
207-
<3> The order of token filters is important, see below.
210+
<2> See <<algorithmic-stemmers>>.
211+
<3> The order of token filters is important, as explained below.
208212

209-
The `spanish_stop` filter comes after the `asciifolding` filter. This means
210-
that `esta`, `èsta` and ++està++ will first have their diacritics removed to
211-
become just `esta`, which is removed as a stopword. If, instead, we wanted to
212-
remove `esta` and `èsta`, but not ++està++, then we would have to put the
213-
`spanish_stop` filter *before* the `asciifolding` filter, and specify both
214-
words in the stopwords list.
213+
We have placed the `spanish_stop` filter after the `asciifolding` filter. This
214+
means that `esta`, `ésta` and ++está++ will first have their diacritics
215+
removed to become just `esta`, which will then be removed as a stopword. If,
216+
instead, we wanted to remove `esta` and `ésta`, but not ++está++, then we
217+
would have to put the `spanish_stop` filter *before* the `asciifolding`
218+
filter, and specify both words in the stopwords list.
215219

216220
[[updating-stopwords]]
217221
==== Updating stopwords
218222

219223
There are a few techniques which can be used to update the list of stopwords
220-
in use. Analyzers are instantiated at index creation time, when a node is
221-
restarted, or when a closed index is reopened.
224+
used by an analyzer. Analyzers are instantiated at index creation time, when a
225+
node is restarted, or when a closed index is reopened.
222226

223227
If you specify stopwords inline with the `stopwords` parameter, then your
224228
only option is to close the index, update the analyzer configuration with the
@@ -227,13 +231,13 @@ the index.
227231

228232
Updating stopwords is easier if you specify them in a file with the
229233
`stopwords_path` parameter. You can just update the file (on every node in
230-
the cluster) then force the analyzers to be recreated by:
234+
the cluster) then force the analyzers to be recreated by either:
231235

232236
* closing and reopening the index
233237
(see {ref}indices-open-close.html[open/close index]), or
234238
* restarting each node in the cluster, one by one.
235239

236240
Of course, updating the stopwords list will not change any documents that have
237-
already been indexed. It will only apply to searches and to new or updated
241+
already been indexed -- it will only apply to searches and to new or updated
238242
documents. To apply the changes to existing documents you will need to
239243
reindex your data. See <<reindex>>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
[[stopwords-performance]]
2+
=== Stopwords and performance
3+
4+
The biggest disadvantage of keeping stopwords is that of performance. When
5+
Elasticsearch performs a full text search, it has to calculate the relevance
6+
`_score` on all matching documents in order to return the top 10 matches.
7+
8+
While most words typically occur in much fewer than 0.1% of all documents, a
9+
few words like `the` may occur in almost all of them. Imagine you have an
10+
index of 1 million documents. A query for `quick brown fox` may match fewer
11+
than 1,000 documents. But a query for `the quick brown fox` has to score and
12+
sort almost all of the 1 million documents in your index, just in order to
13+
return the top 10!
14+
15+
The problem is that `the quick brown fox` is really a query for `the OR quick
16+
OR brown OR fox` -- any document which contains nothing more than the almost
17+
meaningless term `the` is included in the resultset. What we need is a way of
18+
reducing the number of documents that need to be scored.
19+
20+
[[stopwords-and]]
21+
==== `and` operator
22+
23+
The easiest way to reduce the number of documents is simply to use the
24+
<<match-improving-precision,`and` operator>> with the `match` query, in order
25+
to make all words required.
26+
27+
A `match` query like:
28+
29+
[source,json]
30+
---------------------------------
31+
{
32+
"match": {
33+
"text": {
34+
"query": "the quick brown fox",
35+
"operator": "and"
36+
}
37+
}
38+
}
39+
---------------------------------
40+
41+
is rewritten as a `bool` query like:
42+
43+
[source,json]
44+
---------------------------------
45+
{
46+
"bool": {
47+
"must": [
48+
{ "term": { "text": "the" }},
49+
{ "term": { "text": "quick" }},
50+
{ "term": { "text": "brown" }},
51+
{ "term": { "text": "fox" }}
52+
]
53+
}
54+
}
55+
---------------------------------
56+
57+
The `bool` query is intelligent enough to execute each `term` query in the
58+
optimal order -- it starts with the least frequent term. Because all terms
59+
are required, only documents that contain the least frequent term can possibly
60+
match. Using the `and` operator greatly speeds up multi-term queries.
61+
62+
==== `minimum_should_match`
63+
64+
In <<match-precision>> we discussed using the `minimum_should_match` operator
65+
to trim the long tail of less relevant results. It is useful for this purpose
66+
alone but, as a nice side effect, it offers a similar performance benefit to
67+
the `and` operator:
68+
69+
[source,json]
70+
---------------------------------
71+
{
72+
"match": {
73+
"text": {
74+
"query": "the quick brown fox",
75+
"minimum_should_match": "75%"
76+
}
77+
}
78+
}
79+
---------------------------------
80+
81+
In this example, at least three out of the four terms must match. This means
82+
that the only docs that need to be considered are those that contain either the least or second least frequent terms.
83+
84+
This offers a huge performance gain over a simple query with the default `or`
85+
operator! But we can do better yet...
86+

0 commit comments

Comments
 (0)