Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit baecf3d

Browse files
committed
First round of phase 2 changes to sync up with version 2.x.
1 parent e1688d9 commit baecf3d

16 files changed

+148
-179
lines changed

010_Intro/10_Installing_ES.asciidoc

+4-1
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,10 @@ start experimenting with it. A _node_ is a running instance of Elasticsearch.
7373
((("nodes", "defined"))) A _cluster_ is ((("clusters", "defined")))a group of
7474
nodes with the same `cluster.name` that are working together to share data
7575
and to provide failover and scale. (A single node, however, can form a cluster
76-
all by itself.)
76+
all by itself.) You can change the `cluster.name` in the `elasticsearch.yml` configuration
77+
file that's loaded when you start a node. More information about this and other
78+
<<important-configuration-changes, Important Configuration Changes>> is provided
79+
in the Production Deployment section at the end of this book.
7780

7881
TIP: See that View in Sense link at the bottom of the example? <<sense, Install the Sense console>>
7982
to run the examples in this book against your own Elasticsearch cluster and view the results.

020_Distributed_Cluster/20_Add_failover.asciidoc

+3-4
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,10 @@ in exactly the same way as you started the first one (see
1313
share the same directory.
1414
1515
When you run a second node on the same machine, it automatically discovers
16-
and joins the cluster as long as it has the same `cluster.name` as the first node (see
17-
the `./config/elasticsearch.yml` file). However, for nodes running on different machines
16+
and joins the cluster as long as it has the same `cluster.name` as the first node.
17+
However, for nodes running on different machines
1818
to join the same cluster, you need to configure a list of unicast hosts the nodes can contact
19-
to join the cluster. For more information about how Elasticsearch nodes find eachother, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery]
20-
in the Elasticsearch Reference.
19+
to join the cluster. For more information, see <<unicast, Prefer Unicast over Multicast>>.
2120
2221
***************************************
2322

052_Mapping_Analysis/25_Data_type_differences.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ This gives us the following:
4444
"properties": {
4545
"date": {
4646
"type": "date",
47-
"format": "dateOptionalTime"
47+
"format": "strict_date_optional_time||epoch_millis"
4848
},
4949
"name": {
5050
"type": "string"

052_Mapping_Analysis/45_Mapping.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Elasticsearch generated dynamically from the documents that we indexed:
7575
"properties": {
7676
"date": {
7777
"type": "date",
78-
"format": "dateOptionalTime"
78+
"format": "strict_date_optional_time||epoch_millis"
7979
},
8080
"name": {
8181
"type": "string"

060_Distributed_Search.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@ include::060_Distributed_Search/10_Fetch_phase.asciidoc[]
66

77
include::060_Distributed_Search/15_Search_options.asciidoc[]
88

9-
include::060_Distributed_Search/20_Scan_and_scroll.asciidoc[]
9+
include::060_Distributed_Search/20_Scroll.asciidoc[]
1010

060_Distributed_Search/10_Fetch_phase.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ culprits are usually bots or web spiders that tirelessly keep fetching page
5757
after page until your servers crumble at the knees.
5858
5959
If you _do_ need to fetch large numbers of docs from your cluster, you can
60-
do so efficiently by disabling sorting with the `scan` search type,
60+
do so efficiently by disabling sorting with the `scroll` query,
6161
which we discuss <<scan-scroll,later in this chapter>>.
6262
6363
****

060_Distributed_Search/20_Scan_and_scroll.asciidoc

-81
This file was deleted.
+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
[[scroll]]
2+
=== Scroll
3+
4+
A `scroll` query ((("scroll API))) is used to retrieve
5+
large numbers of documents from Elasticsearch efficiently, without paying the
6+
penalty of deep pagination.
7+
8+
Scrolling allows us to((("scrolled search"))) do an initial search and to keep pulling
9+
batches of results from Elasticsearch until there are no more results left.
10+
It's a bit like a _cursor_ in ((("cursors")))a traditional database.
11+
12+
A scrolled search takes a snapshot in time. It doesn't see any changes that
13+
are made to the index after the initial search request has been made. It does
14+
this by keeping the old data files around, so that it can preserve its ``view''
15+
on what the index looked like at the time it started.
16+
17+
The costly part of deep pagination is the global sorting of results, but if we
18+
disable sorting, then we can return all documents quite cheaply. To do this, we
19+
sort by `_doc`. This instructs Elasticsearch just return the next batch of
20+
results from every shard that still has results to return.
21+
22+
To scroll through results, we execute a search request and set the `scroll` value to
23+
the length of time we want to keep the scroll window open. The scroll expiry
24+
time is refreshed every time we run a scroll request, so it only needs to be long enough
25+
to process the current batch of results, not all of the documents that match
26+
the query. The timeout is important because keeping the scroll window open
27+
consumes resources and we want to free them as soon as they are no longer needed.
28+
Setting the timeout enables Elasticsearch to automatically free the resources
29+
after a small period of inactivity.
30+
31+
[source,js]
32+
--------------------------------------------------
33+
GET /old_index/_search?scroll=1m <1>
34+
{
35+
"query": { "match_all": {}},
36+
"sort" : ["_doc"], <2>
37+
"size": 1000
38+
}
39+
--------------------------------------------------
40+
<1> Keep the scroll window open for 1 minute.
41+
<2> `_doc` is the most efficient sort order.
42+
43+
The response to this request includes a
44+
`_scroll_id`, which is a long Base-64 encoded((("scroll_id"))) string. Now we can pass the
45+
`_scroll_id` to the `_search/scroll` endpoint to retrieve the next batch of
46+
results:
47+
48+
[source,js]
49+
--------------------------------------------------
50+
GET /_search/scroll
51+
{
52+
"scroll": "1m", <1>
53+
"scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
54+
}
55+
--------------------------------------------------
56+
<1> Note that we again set the scroll expiration to 1m.
57+
58+
The response to this scroll request includes the next batch of results.
59+
Although we specified a `size` of 1,000, we get back many more
60+
documents.((("size parameter", "in scanning"))) When scanning, the `size` is applied to each shard, so you will
61+
get back a maximum of `size * number_of_primary_shards` documents in each
62+
batch.
63+
64+
NOTE: The scroll request also returns a _new_ `_scroll_id`. Every time
65+
we make the next scroll request, we must pass the `_scroll_id` returned by the
66+
_previous_ scroll request.
67+
68+
When no more hits are returned, we have processed all matching documents.
69+
70+
TIP: Some of the official Elasticsearch clients such as
71+
http://elasticsearch-py.readthedocs.org/en/master/helpers.html#scan[Python client] and
72+
https://metacpan.org/pod/Search::Elasticsearch::Scroll[Perl client] provide scroll helpers that
73+
provide easy-to-use wrappers around this funtionality.
74+

070_Index_Mgmt/50_Reindexing.asciidoc

+3-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ whole document available to you in Elasticsearch itself. You don't have to
1515
rebuild your index from the database, which is usually much slower.
1616

1717
To reindex all of the documents from the old index efficiently, use
18-
<<scan-scroll,_scan-and-scroll_>> to retrieve batches((("scan-and-scroll", "using in reindexing documents"))) of documents from the old index,
18+
<<scan-scroll,_scroll_>> to retrieve batches((("using in reindexing documents"))) of documents from the old index,
1919
and the <<bulk,`bulk` API>> to push them into the new index.
2020

2121
.Reindexing in Batches
@@ -27,7 +27,7 @@ jobs by filtering on a date or timestamp field:
2727
2828
[source,js]
2929
--------------------------------------------------
30-
GET /old_index/_search?search_type=scan&scroll=1m
30+
GET /old_index/_search?scroll=1m
3131
{
3232
"query": {
3333
"range": {
@@ -37,6 +37,7 @@ GET /old_index/_search?search_type=scan&scroll=1m
3737
}
3838
}
3939
},
40+
"sort": ["_doc"],
4041
"size": 1000
4142
}
4243
--------------------------------------------------

400_Relationships/25_Concurrency.asciidoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ PUT /fs/file/1?version=2 <1>
182182
We can even rename a directory, but this means updating all of the files that
183183
exist anywhere in the path hierarchy beneath that directory. This may be
184184
quick or slow, depending on how many files need to be updated. All we would
185-
need to do is to use <<scan-scroll,scan-and-scroll>> to retrieve all the
185+
need to do is to use <<scan-scroll,`scroll`>> to retrieve all the
186186
files, and the <<bulk,`bulk` API>> to update them. The process isn't
187187
atomic, but all files will quickly move to their new home.
188188

400_Relationships/26_Concurrency_solutions.asciidoc

+31-23
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,9 @@ parallelism by making our locking more fine-grained.
8181
==== Document Locking
8282

8383
Instead of locking the whole filesystem, we could lock individual documents
84-
by using the same technique as previously described.((("locking", "document locking")))((("document locking"))) A process could use a
85-
<<scan-scroll,scan-and-scroll>> request to retrieve the IDs of all documents
86-
that would be affected by the change, and would need to create a lock file for
87-
each of them:
84+
by using the same technique as previously described.((("locking", "document locking")))((("document locking")))
85+
We can use a <<scroll,scrolled search>> to retrieve all documents that would be affected by the change and
86+
create a lock file for each one:
8887

8988
[source,json]
9089
--------------------------
@@ -93,7 +92,6 @@ PUT /fs/lock/_bulk
9392
{ "process_id": 123 } <2>
9493
{ "create": { "_id": 2}}
9594
{ "process_id": 123 }
96-
...
9795
--------------------------
9896
<1> The ID of the `lock` document would be the same as the ID of the file
9997
that should be locked.
@@ -135,41 +133,51 @@ POST /fs/lock/1/_update
135133
}
136134
--------------------------
137135

138-
If the document doesn't already exist, the `upsert` document will be inserted--much the same as the `create` request we used previously. However, if the
139-
document _does_ exist, the script will look at the `process_id` stored in the
140-
document. If it is the same as ours, it aborts the update (`noop`) and
141-
returns success. If it is different, the `assert false` throws an exception
142-
and we know that the lock has failed.
136+
If the document doesn't already exist, the `upsert` document is inserted--much
137+
the same as the previous `create` request. However, if the
138+
document _does_ exist, the script looks at the `process_id` stored in the
139+
document. If the `process_id` matches, no update is performed (`noop`) but the
140+
script returns successfully. If it is different, `assert false` throws an exception
141+
and you know that the lock has failed.
142+
143+
Once all locks have been successfully created, you can proceed with your changes.
144+
145+
Afterward, you must release all of the locks, which you can do by
146+
retrieving all of the locked documents and performing a bulk delete:
143147

144-
Once all locks have been successfully created, the rename operation can begin.
145-
Afterward, we must release((("delete-by-query request"))) all of the locks, which we can do with a
146-
`delete-by-query` request:
147148

148149
[source,json]
149150
--------------------------
150151
POST /fs/_refresh <1>
151152
152-
DELETE /fs/lock/_query
153+
GET /fs/lock/_search?scroll=1m <2>
153154
{
154-
"query": {
155-
"term": {
156-
"process_id": 123
155+
"sort" : ["_doc"],
156+
"query": {
157+
"match" : {
158+
"process_id" : 123
159+
}
157160
}
158-
}
159161
}
162+
163+
PUT /fs/lock/_bulk
164+
{ "delete": { "_id": 1}}
165+
{ "delete": { "_id": 2}}
160166
--------------------------
161167
<1> The `refresh` call ensures that all `lock` documents are visible to
162-
the `delete-by-query` request.
168+
the search request.
169+
<2> You can use a <<scan-scroll,`scroll`>> query when you need to retrieve large
170+
numbers of results with a single search request.
163171

164172
Document-level locking enables fine-grained access control, but creating lock
165-
files for millions of documents can be expensive. In certain scenarios, such
166-
as this example with directory trees, it is possible to achieve fine-grained
167-
locking with much less work.
173+
files for millions of documents can be expensive. In some cases,
174+
you can achieve fine-grained locking with much less work, as shown in the
175+
following directory tree scenario.
168176

169177
[[tree-locking]]
170178
==== Tree Locking
171179

172-
Rather than locking every involved document, as in the previous option, we
180+
Rather than locking every involved document as in the previous example, we
173181
could lock just part of the directory tree.((("locking", "tree locking"))) We will need exclusive access
174182
to the file or directory that we want to rename, which can be achieved with an
175183
_exclusive lock_ document:

410_Scaling/45_Index_per_timeframe.asciidoc

+2-19
Original file line numberDiff line numberDiff line change
@@ -29,25 +29,8 @@ data.
2929

3030
If we were to have one big index for documents of this type, we would soon run
3131
out of space. Logging events just keep on coming, without pause or
32-
interruption. We could delete the old events, with a `delete-by-query`:
33-
34-
[source,json]
35-
-------------------------
36-
DELETE /logs/event/_query
37-
{
38-
"query": {
39-
"range": {
40-
"@timestamp": { <1>
41-
"lt": "now-90d"
42-
}
43-
}
44-
}
45-
}
46-
-------------------------
47-
<1> Deletes all documents where Logstash's `@timestamp` field is
48-
older than 90 days.
49-
50-
But this approach is _very inefficient_. Remember that when you delete a
32+
interruption. We could delete the old events with a <<scan-scroll,`scroll`>>
33+
query and bulk delete, but this approach is _very inefficient_. When you delete a
5134
document, it is only _marked_ as deleted (see <<deletes-and-updates>>). It won't
5235
be physically deleted until the segment containing it is merged away.
5336

0 commit comments

Comments
 (0)