Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit 641b303

Browse files
Finished scaling chapter
1 parent 5701e7c commit 641b303

9 files changed

+440
-7
lines changed

410_Scaling.asciidoc

+7
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,12 @@ include::410_Scaling/50_Index_templates.asciidoc[]
1818

1919
include::410_Scaling/55_Retiring_data.asciidoc[]
2020

21+
include::410_Scaling/60_Index_per_user.asciidoc[]
2122

23+
include::410_Scaling/65_Shared_index.asciidoc[]
2224

25+
include::410_Scaling/70_Faking_it.asciidoc[]
26+
27+
include::410_Scaling/75_One_big_user.asciidoc[]
28+
29+
include::410_Scaling/80_Scale_is_not_infinite.asciidoc[]

410_Scaling/10_Intro.asciidoc

+7
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,12 @@ are aware of those limitations and work with them, the growing process will be
1818
pleasant. If you treat Elasticsearch badly, you could be in for a world of
1919
pain.
2020

21+
The default settings in Elasticsearch will take you a long way but, to get the
22+
most bang for your buck, you need to think about how data flows through your
23+
system. We will talk about two common data flows: <<time-based>> like log
24+
events or social network streams where relevance is driven by recency, and
25+
<<user-based>> where a large document corpus can be subdivided by user or
26+
customer.
27+
2128
This chapter will help you to make the right decisions up front, to avoid
2229
nasty surprises later on.

410_Scaling/45_Index_per_timeframe.asciidoc

+7-5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[[time-based]]
2-
=== Time-based documents
2+
=== Time-based data
33

44
One of the most common use cases for Elasticsearch is for logging, so common
55
in fact that Elasticsearch provides an integrated logging platform called the
@@ -48,16 +48,18 @@ DELETE /logs/event/_query
4848
<1> Deletes all documents where Logstash's `@timestamp` field is
4949
older than 90 days.
5050

51-
But this approach is very inefficient. Remember that when you delete a
51+
But this approach is *very inefficient*. Remember that when you delete a
5252
document, it is only marked as deleted (see <<deletes-and-updates>>). It won't
5353
be physically deleted until the segment containing it is merged away.
5454

5555
Instead, use an _index per-timeframe_. You could start out with an index per
56-
year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your
56+
year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your
5757
website gets really busy, you need to switch to an index per day --
58-
`logs_2014-10-24`. The point is: you can adjust your indexing capacity as you
59-
go. There is no need to make a final decision up front.
58+
`logs_2014-10-24`. Purging old data is easy -- just delete old indices.
6059

60+
This approach has the advantage of allowing you to scale as and when you need
61+
to. You don't have to make any difficult decisions up front. Every day is a
62+
new opportunity to change your indexing timeframes to suit the current demand.
6163
Apply the same logic to how big you make each index. Perhaps all you need is
6264
one primary shard per week initially. Later, maybe you need 5 primary shards
6365
per day. It doesn't matter -- you can adjust to new circumstances at any
+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
[[user-based]]
2+
=== User-based data
3+
4+
Often, users start using Elasticsearch because they need to add full text
5+
search or analytics to an existing application. They create a single index
6+
which holds all of their documents. Gradually, others in the company realise
7+
how much benefit Elasticsearch brings, and they want to add their data to
8+
Elasticsearch as well.
9+
10+
Fortunately, Elasticsearch supports
11+
http://en.wikipedia.org/wiki/Multitenancy[multitenancy] so each new user can
12+
have their own index in the same cluster. Occasionally, somebody will want to
13+
search across the documents for all users, which they can do by searching
14+
across all indices, but most of the time, users are only interested in their
15+
own documents.
16+
17+
Some users have more documents than others and some users will have heavier
18+
search loads than others, so the ability to specify how many primary shards
19+
and replica shards each index should have fits well with the index-per-user
20+
model. Similarly, busier indices can be allocated to stronger boxes with shard
21+
allocation filtering. (See <<migrate-indices>>.)
22+
23+
TIP: Don't just use the default number of primary shards for every index.
24+
Think about how much data that index needs to hold. It may be that all you
25+
need is one shard -- any more is a waste of resources.
26+
27+
Most users of Elasticsearch can stop here. A simple index-per-user approach
28+
is sufficient for the majority of cases.
29+
30+
In exceptional cases, you may find that you need to support a large number of
31+
users, all with similar needs. An example might be hosting a search engine
32+
for thousands of email forums. Some forums may have a huge amount of traffic,
33+
but the majority of forums are quite small. Dedicating an index with a single
34+
shard to a small forum is overkill -- a single shard could hold the data for
35+
many forums.
36+
37+
What we need is a way to share resources across users, to give the impression
38+
that each user has their own index without wasting resources on small users.
39+

410_Scaling/65_Shared_index.asciidoc

+152
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
[[shared-index]]
2+
=== Shared index
3+
4+
We can use a large shared index for the many smaller forums by indexing
5+
the forum identifier in a field and using it as a filter:
6+
7+
[source,json]
8+
------------------------------
9+
PUT /forums
10+
{
11+
"settings": {
12+
"number_of_shards": 10 <1>
13+
},
14+
"mappings": {
15+
"post": {
16+
"properties": {
17+
"forum_id": { <2>
18+
  "type": "string",
19+
"index": "not_analyzed"
20+
}
21+
}
22+
}
23+
}
24+
}
25+
26+
PUT /forums/post/1
27+
{
28+
"forum_id": "baking", <2>
29+
"title": "Easy recipe for ginger nuts",
30+
...
31+
}
32+
------------------------------
33+
<1> Create an index large enough to hold thousands of smaller forums.
34+
<2> Each post must include a `forum_id` to identify which forum it belongs
35+
to.
36+
37+
We can use the `forum_id` as a filter to search within a single forum. The
38+
filter will exclude most of the documents in the index (those from other
39+
forums) and filter caching will ensure that responses are fast:
40+
41+
[source,json]
42+
------------------------------
43+
GET /forums/post/_search
44+
{
45+
"query": {
46+
"filtered": {
47+
"query": {
48+
"match": {
49+
"title": "ginger nuts"
50+
}
51+
},
52+
"filter": {
53+
"term": { <1>
54+
"forum_id": {
55+
"baking"
56+
}
57+
}
58+
}
59+
}
60+
}
61+
}
62+
------------------------------
63+
<1> The `term` filter is cached by default.
64+
65+
This approach works, but we can do better. The posts from a single forum
66+
would fit easily onto one shard but currently they are scattered across all 10
67+
shards in the index. This means that every search request has to be forwarded
68+
to a primary or replica of all 10 shards. What would be ideal is to ensure
69+
that all the posts from a single forum are stored on the same shard.
70+
71+
In <<routing-value>>, we explained that a document is allocated to a
72+
particular shard using this formula:
73+
74+
shard = hash(routing) % number_of_primary_shards
75+
76+
The `routing` value defaults to the document's `_id`, but we can override that
77+
and provide our own custom routing value, such as the `forum_id`. All
78+
documents with the same `routing` value will be stored on the same shard:
79+
80+
[source,json]
81+
------------------------------
82+
PUT /forums/post/1?routing=baking <1>
83+
{
84+
"forum_id": "baking", <1>
85+
"title": "Easy recipe for ginger nuts",
86+
...
87+
}
88+
------------------------------
89+
<1> Using the `forum_id` as the routing value ensures that all posts from the
90+
same forum are stored on the same shard.
91+
92+
When we search for posts in a particular forum, we can pass the same `routing`
93+
value to ensure that the search request is only run on the single shard that
94+
holds our documents:
95+
96+
[source,json]
97+
------------------------------
98+
GET /forums/post/_search?routing=baking <1>
99+
{
100+
"query": {
101+
"filtered": {
102+
"query": {
103+
"match": {
104+
"title": "ginger nuts"
105+
}
106+
},
107+
"filter": {
108+
"term": { <2>
109+
"forum_id": {
110+
"baking"
111+
}
112+
}
113+
}
114+
}
115+
}
116+
}
117+
------------------------------
118+
<1> The query is only run on the shard that corresponds to this `routing` value.
119+
<2> We still need the filter, as a single shard can hold posts from many forums.
120+
121+
Multiple forums can be queried by passing a comma-separated list of `routing`
122+
values, and including each `forum_id` in a `terms` filter:
123+
124+
[source,json]
125+
------------------------------
126+
GET /forums/post/_search?routing=baking,cooking,recipes <1>
127+
{
128+
"query": {
129+
"filtered": {
130+
"query": {
131+
"match": {
132+
"title": "ginger nuts"
133+
}
134+
},
135+
"filter": {
136+
"terms": {
137+
"forum_id": {
138+
[ "baking", "cooking", "recipes" ]
139+
}
140+
}
141+
}
142+
}
143+
}
144+
}
145+
------------------------------
146+
147+
While this approach is technically efficient, it looks a bit clumsy because of
148+
the need to specify `routing` values and `terms` filters on every query or
149+
indexing request. Things look a lot better once we add index aliases into the
150+
mix.
151+
152+

410_Scaling/70_Faking_it.asciidoc

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
[[faking-it]]
2+
=== Faking index-per-user with aliases
3+
4+
To keep things simple and clean, we would like our application to believe that
5+
we have a dedicated index per user -- or per forum in our example -- even if
6+
the reality is that we are using one big <<shared-index,shared index>>. To do
7+
that, we need some way to hide the `routing` value and the filter on
8+
`forum_id`.
9+
10+
Index aliases allow us to do just that. When you associate an alias with an
11+
index, you can also specify a filter and routing values:
12+
13+
[source,json]
14+
------------------------------
15+
PUT /forums/_alias/baking
16+
{
17+
"routing": "baking",
18+
"filter": {
19+
"term": {
20+
"forum_id": "baking"
21+
}
22+
}
23+
}
24+
------------------------------
25+
26+
Now, we can treat the `baking` alias as if it were its own index. Documents
27+
indexed into the `baking` alias automatically get the custom routing value
28+
applied:
29+
30+
[source,json]
31+
------------------------------
32+
PUT /baking/post/1 <1>
33+
{
34+
"forum_id": "baking", <1>
35+
"title": "Easy recipe for ginger nuts",
36+
...
37+
}
38+
------------------------------
39+
<1> We still need the `forum_id` field for the filter to work, but
40+
the custom routing value is now implicit.
41+
42+
Queries run against the `baking` alias are run just on the shard associated
43+
with the custom routing value, and the results are automatically filtered by
44+
the filter we specified:
45+
46+
[source,json]
47+
------------------------------
48+
GET /baking/post/_search
49+
{
50+
"query": {
51+
"match": {
52+
"title": "ginger nuts"
53+
}
54+
}
55+
}
56+
------------------------------
57+
58+
Multiple aliases can be specified when searching across multiple forums:
59+
60+
[source,json]
61+
------------------------------
62+
GET /baking,recipes/post/_search <1>
63+
{
64+
"query": {
65+
"match": {
66+
"title": "ginger nuts"
67+
}
68+
}
69+
}
70+
------------------------------
71+
<1> Both `routing` values are applied, and results can match either filter.
72+

410_Scaling/75_One_big_user.asciidoc

+67
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
[[one-big-user]]
2+
=== One big user
3+
4+
Big popular forums start out as small forums. One day we will find that one
5+
shard in our shared index is doing a lot more work than the other shards,
6+
because it holds the documents for a forum which has become very popular. That
7+
forum now needs its own index.
8+
9+
The index aliases that we're using to fake an index-per-user give us a clean
10+
migration path for the big forum.
11+
12+
The first step is to create a new index dedicated to the forum, and with the
13+
appropriate number of shards to allow for expected growth:
14+
15+
[source,json]
16+
------------------------------
17+
PUT /baking_v1
18+
{
19+
"settings": {
20+
"number_of_shards": 3
21+
}
22+
}
23+
------------------------------
24+
25+
The next step is to migrate the data from the shared index into the dedicated
26+
index, which can be done using <<scan-scroll,scan and scroll>> and the
27+
<<bulk,`bulk` API>>. As soon as the migration is finished, the index alias
28+
can be updated to point to the new index:
29+
30+
[source,json]
31+
------------------------------
32+
POST /_aliases
33+
{
34+
"actions": [
35+
{ "remove": { "alias": "baking", "index": "forums" }},
36+
{ "add": { "alias": "baking", "index": "baking_v1" }}
37+
]
38+
}
39+
------------------------------
40+
41+
Updating the alias is atomic, it's like throwing a switch. Your application
42+
continues talking to the `baking` API and is completely unaware that it now
43+
points to a new dedicated index.
44+
45+
The dedicated index no longer needs the filter or the routing values. We can
46+
just rely on the default sharding that Elasticsearch does using each
47+
document's `_id` field.
48+
49+
The last step is to remove the old documents from the shared index, which can
50+
be done with a `delete-by-query` request, using the original routing value and
51+
forum ID:
52+
53+
[source,json]
54+
------------------------------
55+
DELETE /forums/post/_query?routing=baking
56+
{
57+
"query": {
58+
"term": {
59+
"forum_id": "baking"
60+
}
61+
}
62+
}
63+
------------------------------
64+
65+
The beauty of this index-per-user model is that it allows you to reduce
66+
resources, keeping costs low, while still giving you the flexibility to scale
67+
out when necessary, and with zero downtime.

0 commit comments

Comments
 (0)