@@ -5,7 +5,7 @@ the settings alone. We have witnessed countless dozens of clusters ruined
5
5
by errant settings because the administrator thought they could turn a knob
6
6
and gain 100x improvement.
7
7
8
- [IMPORTANT ]
8
+ [NOTE ]
9
9
====
10
10
Please read this entire section! All configurations presented are equally
11
11
important, and are not listed in any particular "importance" order. Please read
@@ -79,7 +79,7 @@ path.logs: /path/to/logs
79
79
path.plugins: /path/to/plugins
80
80
----
81
81
<1> Notice that you can specify more than one directory for data using comma
82
- separated lists.
82
+ separated lists.
83
83
84
84
Data can be saved to multiple directories, and if each of these directories
85
85
are mounted on a different hard drive, this is a simple and effective way to
@@ -94,15 +94,15 @@ where two masters exist in a single cluster.
94
94
95
95
When you have a split-brain, your cluster is at danger of losing data. Because
96
96
the master is considered the "supreme ruler" of the cluster, it decides
97
- when new indices can be created, how shards are moved, etc. If you have _two_
97
+ when new indices can be created, how shards are moved, etc. If you have _two_
98
98
masters, data integrity becomes perilous, since you have two different nodes
99
99
that think they are in charge.
100
100
101
101
This setting tells Elasticsearch to not elect a master unless there are enough
102
102
master-eligible nodes available. Only then will an election take place.
103
103
104
104
This setting should always be configured to a quorum (majority) of your master-
105
- eligible nodes. A quorum is `(number of master-eligible nodes / 2) + 1`.
105
+ eligible nodes. A quorum is `(number of master-eligible nodes / 2) + 1`.
106
106
Some examples:
107
107
108
108
- If you have ten regular nodes (can hold data, can become master), a quorum is
@@ -147,31 +147,31 @@ remove master-eligible nodes.
147
147
==== Recovery settings
148
148
149
149
There are several settings which affect the behavior of shard recovery when
150
- your cluster restarts. First, we need to understand what happens if nothing is
150
+ your cluster restarts. First, we need to understand what happens if nothing is
151
151
configured.
152
152
153
153
Imagine you have 10 nodes, and each node holds a single shard -- either a primary
154
154
or a replica -- in a 5 primary / 1 replica index. You take your
155
- entire cluster offline for maintenance (installing new drives, etc). When you
155
+ entire cluster offline for maintenance (installing new drives, etc). When you
156
156
restart your cluster, it just so happens that five nodes come online before
157
- the other five.
157
+ the other five.
158
158
159
159
Maybe the switch to the other five is being flaky and they didn't
160
- receive the restart command right away. Whatever the reason, you have five nodes
161
- online. These five nodes will gossip with eachother, elect a master and form a
160
+ receive the restart command right away. Whatever the reason, you have five nodes
161
+ online. These five nodes will gossip with eachother, elect a master and form a
162
162
cluster. They notice that data is no longer evenly distributed since five
163
163
nodes are missing from the cluster, and immediately start replicating new
164
164
shards between each other.
165
165
166
166
Finally, your other five nodes turn on and join the cluster. These nodes see
167
- that _their_ data is being replicated to other nodes, so they delete their local
167
+ that _their_ data is being replicated to other nodes, so they delete their local
168
168
data (since it is now redundant, and may be out-dated). Then the cluster starts
169
169
to rebalance even more, since the cluster size just went from five to 10.
170
170
171
171
During this whole process, your nodes are thrashing the disk and network moving
172
- data around...for no good reason. For large clusters with terrabytes of data,
173
- this useless shuffling of data can take a _really long time_. If all the nodes
174
- had simply waited for the cluster to come online, all the data would have been
172
+ data around...for no good reason. For large clusters with terrabytes of data,
173
+ this useless shuffling of data can take a _really long time_. If all the nodes
174
+ had simply waited for the cluster to come online, all the data would have been
175
175
local and nothing would need to move.
176
176
177
177
Now that we know the problem, we can configure a few settings to alleviate it.
@@ -220,9 +220,9 @@ a few nodes on and they automatically find each other and form a cluster.
220
220
221
221
This ease of use is the exact reason you should disable it in production. The
222
222
last thing you want is for nodes to accidentally join your production network, simply
223
- because they received an errant multicast ping. There is nothing wrong with
223
+ because they received an errant multicast ping. There is nothing wrong with
224
224
multicast _per-se_. Multicast simply leads to silly problems, and can be a bit
225
- more fragile (e.g. a network engineer fiddles with the network without telling
225
+ more fragile (e.g. a network engineer fiddles with the network without telling
226
226
you...and all of a sudden nodes can't find each other anymore).
227
227
228
228
In production, it is recommended to use Unicast instead of Multicast. This works
0 commit comments