From fcdf493882144244bec3c204c94801ef7859ae61 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Tue, 26 Apr 2016 17:14:15 -0400 Subject: [PATCH 001/107] Typos, rewording --- 510_Deployment/50_heap.asciidoc | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index ca6a9b2ff..7bac00cb6 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -122,7 +122,7 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.8` java -Xmx32767m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -The morale of the story is that the exact cutoff to leverage compressed oops +The moral of the story is that the exact cutoff to leverage compressed oops varies from JVM to JVM, so take caution when taking examples from elsewhere and be sure to check your system with your configuration and JVM. @@ -147,22 +147,23 @@ of RAM. First, we would recommend avoiding such large machines (see <>). -But if you already have the machines, you have two practical options: +But if you already have the machines, you have three practical options: - Are you doing mostly full-text search? Consider giving 4-32 GB to Elasticsearch and letting Lucene use the rest of memory via the OS filesystem cache. All that memory will cache segments and lead to blisteringly fast full-text search. - Are you doing a lot of sorting/aggregations? Are most of your aggregations on numerics, -dates, geo_points and `not_analyzed` strings? You're in luck! Give Elasticsearch -somewhere from 4-32 GB of memory and leave the rest for the OS to cache doc values -in memory. +dates, geo_points and `not_analyzed` strings? You're in luck, your aggregations will be done on +memory-friendly doc values! Give Elasticsearch somewhere from 4-32 GB of memory and leave the +rest for the OS to cache doc values in memory. - Are you doing a lot of sorting/aggregations on analyzed strings (e.g. for word-tags, or SigTerms, etc)? Unfortunately that means you'll need fielddata, which means you -need heap space. Instead of one node with more than 512 GB of RAM, consider running two or -more nodes on a single machine. Still adhere to the 50% rule, though. So if your -machine has 128 GB of RAM, run two nodes, each with just under 32 GB. This means that less +need heap space. Instead of one node with a huge amount of RAM, consider running two or +more nodes on a single machine. Still adhere to the 50% rule, though. ++ +So if your machine has 128 GB of RAM, run two nodes each with just under 32 GB. This means that less than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. + If you choose this option, set `cluster.routing.allocation.same_shard.host: true` From 40417754c872410433e31af03663f64492fe9dce Mon Sep 17 00:00:00 2001 From: Robert Date: Mon, 23 May 2016 17:12:46 +0200 Subject: [PATCH 002/107] Aggregation response has wrong name (#540) Query defines aggregation with name *popular_colors* but response uses *colors*. --- 300_Aggregations/20_basic_example.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/300_Aggregations/20_basic_example.asciidoc b/300_Aggregations/20_basic_example.asciidoc index af0627226..2c3281000 100644 --- a/300_Aggregations/20_basic_example.asciidoc +++ b/300_Aggregations/20_basic_example.asciidoc @@ -99,7 +99,7 @@ Let's execute that aggregation and take a look at the results: "hits": [] <1> }, "aggregations": { - "colors": { <2> + "popular_colors": { <2> "buckets": [ { "key": "red", <3> @@ -119,7 +119,7 @@ Let's execute that aggregation and take a look at the results: } -------------------------------------------------- <1> No search hits are returned because we set the `size` parameter -<2> Our `colors` aggregation is returned as part of the `aggregations` field. +<2> Our `popular_colors` aggregation is returned as part of the `aggregations` field. <3> The `key` to each bucket corresponds to a unique term found in the `color` field. It also always includes `doc_count`, which tells us the number of docs containing the term. <4> The count of each bucket represents the number of documents with this color. From 138405f998afdd8841c8ffc96209c6628fe45cb0 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Tue, 24 May 2016 10:26:39 +0200 Subject: [PATCH 003/107] Update 30_Tutorial_Search.asciidoc --- 010_Intro/30_Tutorial_Search.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/010_Intro/30_Tutorial_Search.asciidoc b/010_Intro/30_Tutorial_Search.asciidoc index f9717422d..dfdd2781e 100644 --- a/010_Intro/30_Tutorial_Search.asciidoc +++ b/010_Intro/30_Tutorial_Search.asciidoc @@ -210,11 +210,11 @@ GET /megacorp/employee/_search { "query" : { "bool": { - "must": [ + "must": { "match" : { "last_name" : "smith" <1> } - ], + }, "filter": { "range" : { "age" : { "gt" : 30 } <2> From 07a4454c86f4aaeff013441fdb52b97ad4ae6d11 Mon Sep 17 00:00:00 2001 From: Olivier Bourgain Date: Tue, 24 May 2016 15:52:46 +0200 Subject: [PATCH 004/107] Fix nested sort doc (#532) --- 402_Nested/33_Nested_sorting.asciidoc | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/402_Nested/33_Nested_sorting.asciidoc b/402_Nested/33_Nested_sorting.asciidoc index 37adb15e0..c54119090 100644 --- a/402_Nested/33_Nested_sorting.asciidoc +++ b/402_Nested/33_Nested_sorting.asciidoc @@ -56,7 +56,8 @@ GET /_search "comments.stars": { <2> "order": "asc", <2> "mode": "min", <2> - "nested_filter": { <3> + "nested_path": "comments", <3> + "nested_filter": { "range": { "comments.date": { "gte": "2014-10-01", @@ -72,10 +73,10 @@ GET /_search comment in October. <2> Results are sorted in ascending (`asc`) order by the lowest value (`min`) in the `comment.stars` field in any matching comments. -<3> The `nested_filter` in the sort clause is the same as the `nested` query in +<3> The `nested_path` and `nested_filter` in the sort clause are the same as the `nested` query in the main `query` clause. The reason is explained next. -Why do we need to repeat the query conditions in the `nested_filter`? The +Why do we need to repeat the query conditions in the `nested_path` and `nested_filter`? The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn't include the `nested_filter` From f199aa39d13899e18af685de54a5cdac249404e5 Mon Sep 17 00:00:00 2001 From: Julien Pivotto Date: Tue, 31 May 2016 22:38:23 +0200 Subject: [PATCH 005/107] Add a note about the Reindex API (#531) Fixes #530 --- 070_Index_Mgmt/50_Reindexing.asciidoc | 3 +++ 1 file changed, 3 insertions(+) diff --git a/070_Index_Mgmt/50_Reindexing.asciidoc b/070_Index_Mgmt/50_Reindexing.asciidoc index 59b5ca0e7..a0d54ed14 100644 --- a/070_Index_Mgmt/50_Reindexing.asciidoc +++ b/070_Index_Mgmt/50_Reindexing.asciidoc @@ -18,6 +18,9 @@ To reindex all of the documents from the old index efficiently, use <> to retrieve batches((("using in reindexing documents"))) of documents from the old index, and the <> to push them into the new index. +Beginning with Elasticsearch v2.3.0, a {ref}/docs-reindex.html[Reindex API] has been introduced. It enables you +to reindex your documents without requiring any plugin nor external tool. + .Reindexing in Batches **** From 722eb135ce8c407509fd43c0d365bb22846db96e Mon Sep 17 00:00:00 2001 From: Chuck Date: Wed, 1 Jun 2016 04:48:56 +0800 Subject: [PATCH 006/107] Update 20_basic_example.asciidoc (#512) FIX: The example aggregation name in the request is 'popular_colors'. But it turned into 'colors' in the response. From 8ff2fbdcb6427ddc5088fc7aa56e97da4fd2a486 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Tue, 31 May 2016 16:51:05 -0400 Subject: [PATCH 007/107] Comment the PR template People keep sending PRs with the text intact, comment it out for my sanity :) --- .github/PULL_REQUEST_TEMPLATE.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 40d99cc66..3187fc69e 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,6 +1,8 @@ + From 8ad735a8480fc1d0de91d88bf01d92f48c2de14c Mon Sep 17 00:00:00 2001 From: ericamick Date: Tue, 24 May 2016 09:53:27 -0400 Subject: [PATCH 008/107] Update 50_Analysis_chain.asciidoc (#541) --- 260_Synonyms/50_Analysis_chain.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/260_Synonyms/50_Analysis_chain.asciidoc b/260_Synonyms/50_Analysis_chain.asciidoc index e7b642366..66ea49fb8 100644 --- a/260_Synonyms/50_Analysis_chain.asciidoc +++ b/260_Synonyms/50_Analysis_chain.asciidoc @@ -39,7 +39,7 @@ stemmer, and to list just the root words that would be emitted by the stemmer: Normally, synonym filters are placed after the `lowercase` token filter and so all synonyms are ((("synonyms", "and the analysis chain", "case-sensitive synonyms")))((("case-sensitive synonyms")))written in lowercase, but sometimes that can lead to odd conflations. For instance, a `CAT` scan and a `cat` are quite different, as -are `PET` (positron emmision tomography) and a `pet`. For that matter, the +are `PET` (positron emission tomography) and a `pet`. For that matter, the surname `Little` is distinct from the adjective `little` (although if a sentence starts with the adjective, it will be uppercased anyway). @@ -49,7 +49,7 @@ that your synonym rules would need to list all of the case variations that you want to match (for example, `Little,LITTLE,little`). Instead of that, you could have two synonym filters: one to catch the case-sensitive -synonyms and one for all the case-insentive synonyms. For instance, the +synonyms and one for all the case-insensitive synonyms. For instance, the case-sensitive rules could look like this: "CAT,CAT scan => cat_scan" @@ -57,7 +57,7 @@ case-sensitive rules could look like this: "Johnny Little,J Little => johnny_little" "Johnny Small,J Small => johnny_small" -And the case-insentive rules could look like this: +And the case-insensitive rules could look like this: "cat => cat,pet" "dog => dog,pet" From 2bbb25c53520935c9d170b1f03ed7be67c64d92b Mon Sep 17 00:00:00 2001 From: Brian Atwood Date: Tue, 31 May 2016 15:37:37 -0500 Subject: [PATCH 009/107] Fix typo (#544) --- 110_Multi_Field_Search/05_Multiple_query_strings.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc b/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc index ef2f0c54e..322a9c964 100644 --- a/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc +++ b/110_Multi_Field_Search/05_Multiple_query_strings.asciidoc @@ -73,7 +73,7 @@ would have reduced the contribution of the title and author clauses to one-quart It is likely that an even one-third split between clauses is not what we need for the preceding query. ((("multifield search", "multiple query strings", "prioritizing query clauses")))((("bool query", "prioritizing clauses"))) Probably we're more interested in the title and author -clauses then we are in the translator clauses. We need to tune the query to +clauses than we are in the translator clauses. We need to tune the query to make the title and author clauses relatively more important. The simplest weapon in our tuning arsenal is the `boost` parameter. To From 689252283e1d9006683691bb142adef7e7958ef8 Mon Sep 17 00:00:00 2001 From: Natthakit Susanthitanon Date: Wed, 1 Jun 2016 03:38:05 +0700 Subject: [PATCH 010/107] Fix typo in 40_bitsets.asciidoc (#542) --- 080_Structured_Search/40_bitsets.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/080_Structured_Search/40_bitsets.asciidoc b/080_Structured_Search/40_bitsets.asciidoc index 1522a9efd..38a690bac 100644 --- a/080_Structured_Search/40_bitsets.asciidoc +++ b/080_Structured_Search/40_bitsets.asciidoc @@ -81,7 +81,7 @@ that was cacheable. This often meant the system cached bitsets too aggressively and performance suffered due to thrashing the cache. In addition, many filters are very fast to evaluate, but substantially slower to cache (and reuse from cache). These filters don't make sense to cache, since you'd be better off just re-executing -the fitler again. +the filter again. Inspecting the inverted index is very fast and most query components are rare. Consider a `term` filter on a `"user_id"` field: if you have millions of users, From 70087210ee09ec6995cd6c93331b2f0c9010638a Mon Sep 17 00:00:00 2001 From: ericamick Date: Tue, 31 May 2016 16:40:21 -0400 Subject: [PATCH 011/107] Update 20_hardware.asciidoc (#523) --- 510_Deployment/20_hardware.asciidoc | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index acc588466..ec3954462 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -2,7 +2,7 @@ === Hardware If you've been following the normal development path, you've probably been playing((("deployment", "hardware")))((("hardware"))) -with Elasticsearch on your laptop or on a small cluster of machines laying around. +with Elasticsearch on your laptop or on a small cluster of machines lying around. But when it comes time to deploy Elasticsearch to production, there are a few recommendations that you should consider. Nothing is a hard-and-fast rule; Elasticsearch is used for a wide range of tasks and on a bewildering array of @@ -27,8 +27,7 @@ discuss in <>. Most Elasticsearch deployments tend to be rather light on CPU requirements. As such,((("CPUs (central processing units)")))((("hardware", "CPUs"))) the exact processor setup matters less than the other resources. You should -choose a modern processor with multiple cores. Common clusters utilize two to eight -core machines. +choose a modern processor with multiple cores. Common clusters utilize two- to eight-core machines. If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster From 97f43dee6886d11b8f2bcaed2bebc3babf9d8851 Mon Sep 17 00:00:00 2001 From: ericamick Date: Tue, 31 May 2016 16:40:45 -0400 Subject: [PATCH 012/107] Update 40_other_stats.asciidoc (#522) --- 500_Cluster_Admin/40_other_stats.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/500_Cluster_Admin/40_other_stats.asciidoc b/500_Cluster_Admin/40_other_stats.asciidoc index 6224aee64..4d2a120c4 100644 --- a/500_Cluster_Admin/40_other_stats.asciidoc +++ b/500_Cluster_Admin/40_other_stats.asciidoc @@ -42,7 +42,7 @@ GET _all/_stats <3> ---- <1> Stats for `my_index`. <2> Stats for multiple indices can be requested by separating their names with a comma. -<3> Stats indices can be requested using the special `_all` index name. +<3> Stats for all indices can be requested using the special `_all` index name. The stats returned will be familar to the `node-stats` output: `search` `fetch` `get` `index` `bulk` `segment counts` and so forth From d384748f892dd54ff61e9fd31e691e51ae558634 Mon Sep 17 00:00:00 2001 From: ericamick Date: Tue, 31 May 2016 16:43:50 -0400 Subject: [PATCH 013/107] Update 20_health.asciidoc (#521) --- 500_Cluster_Admin/20_health.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/500_Cluster_Admin/20_health.asciidoc b/500_Cluster_Admin/20_health.asciidoc index 1adf814f0..2b5e636b6 100644 --- a/500_Cluster_Admin/20_health.asciidoc +++ b/500_Cluster_Admin/20_health.asciidoc @@ -50,7 +50,7 @@ high availability is compromised to some degree. If _more_ shards disappear, yo might lose data. Think of `yellow` as a warning that should prompt investigation. `red`:: - At least one primary shard (and all of its replicas) are missing. This means + At least one primary shard (and all of its replicas) is missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception. @@ -205,7 +205,7 @@ This is important for automated scripts and tests. If you create an index, Elasticsearch must broadcast the change in cluster state to all nodes. Those nodes must initialize those new shards, and then respond to the -master that the shards are `Started`. This process is fast, but because network +master that the shards are `Started`. This process is fast, but because of network latency may take 10–20ms. If you have an automated script that (a) creates an index and then (b) immediately From 98221d5a17619908a7e3ddd040b0d06079467a8f Mon Sep 17 00:00:00 2001 From: romainsalles Date: Tue, 31 May 2016 22:44:54 +0200 Subject: [PATCH 014/107] Add missing "inside" word (#515) --- 070_Index_Mgmt/25_Mappings.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/070_Index_Mgmt/25_Mappings.asciidoc b/070_Index_Mgmt/25_Mappings.asciidoc index d9c6464d2..f9897072f 100644 --- a/070_Index_Mgmt/25_Mappings.asciidoc +++ b/070_Index_Mgmt/25_Mappings.asciidoc @@ -154,5 +154,5 @@ In summary: - **Good:** `kitchen` and `lawn-care` types inside the `products` index, because the two types are essentially the same schema -- **Bad:** `products` and `logs` types the `data` index, because the two types are +- **Bad:** `products` and `logs` types inside the `data` index, because the two types are mutually exclusive. Separate these into their own indices. From 69862644885f1c6fb865db860f71029485fe3939 Mon Sep 17 00:00:00 2001 From: "Md.Abdulla-Al-Sun" Date: Wed, 1 Jun 2016 02:46:19 +0600 Subject: [PATCH 015/107] Update the misplacement of Comma (#524) In my sense, the comma should be after the closing inverted comma. --- 120_Proximity_Matching/05_Phrase_matching.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/120_Proximity_Matching/05_Phrase_matching.asciidoc b/120_Proximity_Matching/05_Phrase_matching.asciidoc index 645f8aedb..2d0af4695 100644 --- a/120_Proximity_Matching/05_Phrase_matching.asciidoc +++ b/120_Proximity_Matching/05_Phrase_matching.asciidoc @@ -95,7 +95,7 @@ all the words in exactly the order specified, with no words in-between. ==== What Is a Phrase -For a document to be considered a((("match_phrase query", "documents matching a phrase")))((("phrase matching", "criteria for matching documents"))) match for the phrase ``quick brown fox,'' the following must be true: +For a document to be considered a((("match_phrase query", "documents matching a phrase")))((("phrase matching", "criteria for matching documents"))) match for the phrase ``quick brown fox'', the following must be true: * `quick`, `brown`, and `fox` must all appear in the field. From 0510fc41111e1e12432c8aae5b6bc684143b32ca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Rafa=C5=82=20Bigaj?= <4rafalbigaj@gmail.com> Date: Tue, 31 May 2016 22:46:36 +0200 Subject: [PATCH 016/107] Colon added before code snippet (#516) Hope it helps :) --- 056_Sorting/88_String_sorting.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index f35a59058..db220ea1b 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -22,7 +22,7 @@ and one that is `not_analyzed` for sorting. But storing the same string twice in the `_source` field is waste of space. What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a -simple mapping like +simple mapping like: [source,js] -------------------------------------------------- From 0ead255bc3ccc77e65b38c9179538a71f96f2496 Mon Sep 17 00:00:00 2001 From: Prashant Tiwari Date: Wed, 1 Jun 2016 02:17:50 +0530 Subject: [PATCH 017/107] Fix token positions (#513) --- 240_Stopwords/20_Using_stopwords.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/240_Stopwords/20_Using_stopwords.asciidoc b/240_Stopwords/20_Using_stopwords.asciidoc index 4fa4e438a..3c3fd47f2 100644 --- a/240_Stopwords/20_Using_stopwords.asciidoc +++ b/240_Stopwords/20_Using_stopwords.asciidoc @@ -70,14 +70,14 @@ The quick and the dead "start_offset": 4, "end_offset": 9, "type": "", - "position": 2 <1> + "position": 1 <1> }, { "token": "dead", "start_offset": 18, "end_offset": 22, "type": "", - "position": 5 <1> + "position": 4 <1> } ] } From ff611155d23ae901f0b20768d413a157a4cb0cf0 Mon Sep 17 00:00:00 2001 From: gopimanikandan Date: Wed, 1 Jun 2016 02:28:30 +0530 Subject: [PATCH 018/107] Update 60_restore.asciidoc (#496) The command in the documentation not working as expected. It's showing error. I have the updated the command in the documentation after checking the command --- 520_Post_Deployment/60_restore.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/520_Post_Deployment/60_restore.asciidoc b/520_Post_Deployment/60_restore.asciidoc index f0ef88b6d..a4dd37f45 100644 --- a/520_Post_Deployment/60_restore.asciidoc +++ b/520_Post_Deployment/60_restore.asciidoc @@ -66,7 +66,7 @@ The API can be invoked for the specific indices that you are recovering: [source,js] ---- -GET /_recovery/restored_index_3 +GET restored_index_3/_recovery ---- Or for all indices in your cluster, which may include other shards moving around, From 035ce3c71dfe693e5432400b33ff62a39244d529 Mon Sep 17 00:00:00 2001 From: ericamick Date: Tue, 31 May 2016 17:02:55 -0400 Subject: [PATCH 019/107] Update 10_Intro.asciidoc (#507) --- 400_Relationships/10_Intro.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/400_Relationships/10_Intro.asciidoc b/400_Relationships/10_Intro.asciidoc index e9461a449..438c2d8ac 100644 --- a/400_Relationships/10_Intro.asciidoc +++ b/400_Relationships/10_Intro.asciidoc @@ -25,7 +25,7 @@ surprise to you--to manage((("relational databases", "managing relationships"))) entities. But relational ((("ACID transactions")))databases do have their limitations, besides their poor support -for full-text search. Joining entities at query time is expensive--more +for full-text search. Joining entities at query time is expensive--the more joins that are required, the more expensive the query. Performing joins between entities that live on different hardware is so expensive that it is just not practical. This places a limit on the amount of data that can be From 235cb476d9dc3852b4eebb6883b5f5ddb8b32fa3 Mon Sep 17 00:00:00 2001 From: Aaron Johnson Date: Tue, 31 May 2016 17:03:54 -0400 Subject: [PATCH 020/107] Fix drunken indentation. (#497) * Fix drunken indentation. * Fix more drunken indentation. --- 402_Nested/32_Nested_query.asciidoc | 52 +++++++++++++++++++++++------ 1 file changed, 42 insertions(+), 10 deletions(-) diff --git a/402_Nested/32_Nested_query.asciidoc b/402_Nested/32_Nested_query.asciidoc index 54c7d7702..d680ceb7a 100644 --- a/402_Nested/32_Nested_query.asciidoc +++ b/402_Nested/32_Nested_query.asciidoc @@ -12,17 +12,32 @@ GET /my_index/blogpost/_search "query": { "bool": { "must": [ - { "match": { "title": "eggs" }}, <1> + { + "match": { + "title": "eggs" <1> + } + }, { "nested": { "path": "comments", <2> "query": { "bool": { "must": [ <3> - { "match": { "comments.name": "john" }}, - { "match": { "comments.age": 28 }} + { + "match": { + "comments.name": "john" + } + }, + { + "match": { + "comments.age": 28 + } + } ] - }}}} + } + } + } + } ] }}} -------------------------- @@ -58,20 +73,37 @@ GET /my_index/blogpost/_search "query": { "bool": { "must": [ - { "match": { "title": "eggs" }}, + { + "match": { + "title": "eggs" + } + }, { "nested": { - "path": "comments", + "path": "comments", "score_mode": "max", <1> "query": { "bool": { "must": [ - { "match": { "comments.name": "john" }}, - { "match": { "comments.age": 28 }} + { + "match": { + "comments.name": "john" + } + }, + { + "match": { + "comments.age": 28 + } + } ] - }}}} + } + } + } + } ] -}}} + } + } +} -------------------------- <1> Give the root document the `_score` from the best-matching nested document. From e45138a50d5d0018c8a9fad426244017dcb3229e Mon Sep 17 00:00:00 2001 From: fabiohlc Date: Tue, 31 May 2016 18:16:54 -0300 Subject: [PATCH 021/107] Update 50_Sorting_by_distance.asciidoc (#389) A simple error a , instead of a . ;-) --- 310_Geopoints/50_Sorting_by_distance.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/310_Geopoints/50_Sorting_by_distance.asciidoc b/310_Geopoints/50_Sorting_by_distance.asciidoc index 8d37804f5..ce09e8b84 100644 --- a/310_Geopoints/50_Sorting_by_distance.asciidoc +++ b/310_Geopoints/50_Sorting_by_distance.asciidoc @@ -17,7 +17,7 @@ GET /attractions/restaurant/_search "type": "indexed", "location": { "top_left": { - "lat": 40,8, + "lat": 40.8, "lon": -74.0 }, "bottom_right": { From 1eb6b0c18583124b349da62a298a7d9b2cb05ead Mon Sep 17 00:00:00 2001 From: sallyx Date: Tue, 31 May 2016 23:17:15 +0200 Subject: [PATCH 022/107] Update 25_ranges.asciidoc (#376) This change reflects real BETWEEN behavior. --- 080_Structured_Search/25_ranges.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/080_Structured_Search/25_ranges.asciidoc b/080_Structured_Search/25_ranges.asciidoc index 9bdc8b64c..6fc15bfe3 100644 --- a/080_Structured_Search/25_ranges.asciidoc +++ b/080_Structured_Search/25_ranges.asciidoc @@ -20,8 +20,8 @@ can be used to find documents falling inside a range: -------------------------------------------------- "range" : { "price" : { - "gt" : 20, - "lt" : 40 + "gte" : 20, + "lte" : 40 } } -------------------------------------------------- From 3a9ca12a9c74563813cb1b7fbfc1e88f4380f0ab Mon Sep 17 00:00:00 2001 From: oyiadom Date: Tue, 31 May 2016 17:18:49 -0400 Subject: [PATCH 023/107] Update 45_Popularity.asciidoc (#368) If I am understanding example and how the "missing" property works, then I think my suggestion helps to make more people aware of their development options. In this example, I think the value of the "missing" property would have to be strictly between 0 and 1 in order to get the desired boosting. --- 170_Relevance/45_Popularity.asciidoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/170_Relevance/45_Popularity.asciidoc b/170_Relevance/45_Popularity.asciidoc index 4df86238e..de79cda09 100644 --- a/170_Relevance/45_Popularity.asciidoc +++ b/170_Relevance/45_Popularity.asciidoc @@ -45,7 +45,8 @@ GET /blogposts/post/_search <3> The `field_value_factor` function is applied to every document matching the main `query`. <4> Every document _must_ have a number in the `votes` field for - the `function_score` to work. + the `function_score` to work. If every document does _not_ have a number in the `votes` field, then you _must_ use the + http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor[`missing` property] to provide a default value for the score calculation. In the preceding example, the final `_score` for each document has been altered as follows: From 098da37d151cbbc4dd2d0e04e0cf3a41796b4383 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Wed, 1 Jun 2016 09:32:48 +0200 Subject: [PATCH 024/107] Ref docs should point to master instead of current --- book.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book.asciidoc b/book.asciidoc index 42993718c..e3c62671a 100644 --- a/book.asciidoc +++ b/book.asciidoc @@ -1,6 +1,6 @@ :bookseries: animal :es_build: 1 -:ref: https://www.elastic.co/guide/en/elasticsearch/reference/current +:ref: https://www.elastic.co/guide/en/elasticsearch/reference/master = Elasticsearch: The Definitive Guide From 62eda78747ffed975cc5d3b96053012104a77e02 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Wed, 1 Jun 2016 09:32:54 +0200 Subject: [PATCH 025/107] Fixed bad link --- 170_Relevance/45_Popularity.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/170_Relevance/45_Popularity.asciidoc b/170_Relevance/45_Popularity.asciidoc index de79cda09..be7e07adb 100644 --- a/170_Relevance/45_Popularity.asciidoc +++ b/170_Relevance/45_Popularity.asciidoc @@ -46,7 +46,7 @@ GET /blogposts/post/_search the main `query`. <4> Every document _must_ have a number in the `votes` field for the `function_score` to work. If every document does _not_ have a number in the `votes` field, then you _must_ use the - http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor[`missing` property] to provide a default value for the score calculation. + {ref}/query-dsl-function-score-query.html#function-field-value-factor[`missing` property] to provide a default value for the score calculation. In the preceding example, the final `_score` for each document has been altered as follows: From 50be03dfd1bfeeb9dff0ccbd45359c6aff58dba0 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Wed, 1 Jun 2016 11:24:54 +0200 Subject: [PATCH 026/107] Fixed bad links --- 075_Inside_a_shard/50_Persistent_changes.asciidoc | 2 +- 170_Relevance/65_Script_score.asciidoc | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/075_Inside_a_shard/50_Persistent_changes.asciidoc b/075_Inside_a_shard/50_Persistent_changes.asciidoc index 55dcdf015..9b67237b8 100644 --- a/075_Inside_a_shard/50_Persistent_changes.asciidoc +++ b/075_Inside_a_shard/50_Persistent_changes.asciidoc @@ -83,7 +83,7 @@ image::images/elas_1109.png["After a flush, the segments are fully commited and The action of performing a commit and truncating the translog is known in Elasticsearch as a _flush_. ((("flushes"))) Shards are flushed automatically every 30 minutes, or when the translog becomes too big. See the -{ref}/index-modules-translog.html#_translog_settings[`translog` documentation] for settings +{ref}/index-modules-translog.html[`translog` documentation] for settings that can be used((("translog (transaction log)", "flushes and"))) to control these thresholds: The {ref}/indices-flush.html[`flush` API] can ((("indices", "flushing")))((("flush API")))be used to perform a manual flush: diff --git a/170_Relevance/65_Script_score.asciidoc b/170_Relevance/65_Script_score.asciidoc index c6ca3babf..ca914ea49 100644 --- a/170_Relevance/65_Script_score.asciidoc +++ b/170_Relevance/65_Script_score.asciidoc @@ -80,7 +80,7 @@ GET /_search "discount": 0.1, "target": 10 }, - "script": "price = doc['price'].value; margin = doc['margin'].value; + "script": "price = doc['price'].value; margin = doc['margin'].value; if (price < threshold) { return price * margin / target }; return price * (1 - discount) * margin / target;" <3> } @@ -115,7 +115,7 @@ scripts are not quite fast enough, you have three options: document. * Groovy is fast, but not quite as fast as Java.((("Java", "scripting in"))) You could reimplement your script as a native Java script. (See - {ref}/modules-scripting.html#native-java-scripts[Native Java Scripts]). + {ref}/modules-scripting-native.html[Native Java Scripts]). * Use the `rescore` functionality((("rescoring"))) described in <> to apply your script to only the best-scoring documents. From 7f8db850a36fc681f2d1fca5fcc4a91c52a34d6a Mon Sep 17 00:00:00 2001 From: Guillermo Mansilla Date: Fri, 3 Jun 2016 10:58:29 -0400 Subject: [PATCH 027/107] Update 15_API.asciidoc (#466) --- 010_Intro/15_API.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/010_Intro/15_API.asciidoc b/010_Intro/15_API.asciidoc index 42ab6ee02..f6c1baa8f 100644 --- a/010_Intro/15_API.asciidoc +++ b/010_Intro/15_API.asciidoc @@ -1,6 +1,6 @@ === Talking to Elasticsearch -How you talk to Elasticsearch depends on((("Elasticsearch", "talking to"))) whether you are using Java. +How you talk to Elasticsearch depends on((("Elasticsearch", "talking to"))) whether you are using Java or not. ==== Java API From e43540851920a1ee26d9ffdca3adb6ea92d3e353 Mon Sep 17 00:00:00 2001 From: Lonre Wang Date: Fri, 3 Jun 2016 22:59:07 +0800 Subject: [PATCH 028/107] Typo fix (#550) --- 210_Identifying_words/50_Tidying_text.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/210_Identifying_words/50_Tidying_text.asciidoc b/210_Identifying_words/50_Tidying_text.asciidoc index 30d3c0365..f8acbf26d 100644 --- a/210_Identifying_words/50_Tidying_text.asciidoc +++ b/210_Identifying_words/50_Tidying_text.asciidoc @@ -15,7 +15,7 @@ For example: [source,js] -------------------------------------------------- -GET /_analyzer?tokenizer=standard +GET /_analyze?tokenizer=standard

Some déjà vu website -------------------------------------------------- @@ -34,7 +34,7 @@ in the query string: [source,js] -------------------------------------------------- -GET /_analyzer?tokenizer=standard&char_filters=html_strip +GET /_analyze?tokenizer=standard&char_filters=html_strip

Some déjà vu website -------------------------------------------------- @@ -62,7 +62,7 @@ Once created, our new `my_html_analyzer` can be tested with the `analyze` API: [source,js] -------------------------------------------------- -GET /my_index/_analyzer?analyzer=my_html_analyzer +GET /my_index/_analyze?analyzer=my_html_analyzer

Some déjà vu website -------------------------------------------------- From 89dc347778ce679fab2183c9ebe4ebbf77cab55c Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 13:37:10 -0400 Subject: [PATCH 029/107] Use new body specification for analyze API Closes #510 Closes #473 Closes #433 Closes #511 --- 052_Mapping_Analysis/40_Analysis.asciidoc | 7 +++++-- 052_Mapping_Analysis/45_Mapping.asciidoc | 12 +++++++----- 080_Structured_Search/05_term.asciidoc | 9 +++++++-- snippets/052_Mapping_Analysis/40_Analyze.json | 6 +++++- snippets/052_Mapping_Analysis/45_Mapping.json | 12 ++++++++++-- snippets/080_Structured_Search/05_Term_number.json | 7 +++++++ 6 files changed, 41 insertions(+), 12 deletions(-) diff --git a/052_Mapping_Analysis/40_Analysis.asciidoc b/052_Mapping_Analysis/40_Analysis.asciidoc index 2fd738a3c..3244139bf 100644 --- a/052_Mapping_Analysis/40_Analysis.asciidoc +++ b/052_Mapping_Analysis/40_Analysis.asciidoc @@ -159,8 +159,11 @@ parameters, and the text to analyze in the body: [source,js] -------------------------------------------------- -GET /_analyze?analyzer=standard -Text to analyze +GET /_analyze +{ + "analyzer": "standard", + "text": "Text to analyze" +} -------------------------------------------------- // SENSE: 052_Mapping_Analysis/40_Analyze.json diff --git a/052_Mapping_Analysis/45_Mapping.asciidoc b/052_Mapping_Analysis/45_Mapping.asciidoc index 3d0dbaa75..8408ce78b 100644 --- a/052_Mapping_Analysis/45_Mapping.asciidoc +++ b/052_Mapping_Analysis/45_Mapping.asciidoc @@ -144,10 +144,10 @@ can contain one of three values: `analyzed`:: First analyze the string and then index it. In other words, index this field as full text. - `not_analyzed`:: + `not_analyzed`:: Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it. - `no`:: + `no`:: Don't index this field at all. This field will not be searchable. The default value of `index` for a `string` field is `analyzed`. If we @@ -204,7 +204,7 @@ for an existing type) later, using the `/_mapping` endpoint. ================================================ Although you can _add_ to an existing mapping, you can't _change_ existing field mappings. If a mapping already exists for a field, data from that -field has probably been indexed. If you were to change the field mapping, +field has probably been indexed. If you were to change the field mapping, the indexed data would be wrong and would not be properly searchable. ================================================ @@ -278,13 +278,15 @@ name. Compare the output of these two requests: [source,js] -------------------------------------------------- -GET /gb/_analyze?field=tweet +GET /gb/_analyze { + "field": "tweet" "text": "Black-cats" <1> } -GET /gb/_analyze?field=tag +GET /gb/_analyze { + "field": "tag", "text": "Black-cats" <1> } -------------------------------------------------- diff --git a/080_Structured_Search/05_term.asciidoc b/080_Structured_Search/05_term.asciidoc index 170ed3181..b65350536 100644 --- a/080_Structured_Search/05_term.asciidoc +++ b/080_Structured_Search/05_term.asciidoc @@ -147,9 +147,14 @@ can see that our UPC has been tokenized into smaller tokens: [source,js] -------------------------------------------------- -GET /my_store/_analyze?field=productID -XHDK-A-1293-#fJ3 +GET /my_store/_analyze +{ + "field": "productID", + "text": "XHDK-A-1293-#fJ3" +} -------------------------------------------------- +// SENSE: 080_Structured_Search/05_Term_text.json + [source,js] -------------------------------------------------- { diff --git a/snippets/052_Mapping_Analysis/40_Analyze.json b/snippets/052_Mapping_Analysis/40_Analyze.json index e2043871d..1e48df8d5 100644 --- a/snippets/052_Mapping_Analysis/40_Analyze.json +++ b/snippets/052_Mapping_Analysis/40_Analyze.json @@ -1,2 +1,6 @@ # Analyze the `text` with the `standard` analyzer -GET /_analyze?analyzer=standard&text=Text to analyze +GET /_analyze +{ + "analyzer": "standard", + "text": "Text to analyze" +} diff --git a/snippets/052_Mapping_Analysis/45_Mapping.json b/snippets/052_Mapping_Analysis/45_Mapping.json index 683c73403..6e1ac8b3c 100644 --- a/snippets/052_Mapping_Analysis/45_Mapping.json +++ b/snippets/052_Mapping_Analysis/45_Mapping.json @@ -40,7 +40,15 @@ PUT /gb/_mapping/tweet GET /gb/_mapping/tweet # Test the analyzer for the `tweet` field -GET /gb/_analyze?field=tweet&text=Black-cats +GET /gb/_analyze +{ + "field": "tweet", + "text": "Black-cats" +} # Test the analyzer for the `tag` field -GET /gb/_analyze?field=tag&text=Black-cats \ No newline at end of file +GET /gb/_analyze +{ + "field": "tag", + "text": "Black-cats" +} diff --git a/snippets/080_Structured_Search/05_Term_number.json b/snippets/080_Structured_Search/05_Term_number.json index 25a3b7f99..d718770e2 100644 --- a/snippets/080_Structured_Search/05_Term_number.json +++ b/snippets/080_Structured_Search/05_Term_number.json @@ -26,6 +26,13 @@ GET /my_store/products/_search } } +# Check the analyzed tokens +GET /my_store/_analyze +{ + "field": "productID", + "text": "XHDK-A-1293-#fJ3" +} + # Same as above, without the `match_all` query GET /my_store/products/_search { From da9d10d8404d4de9d67ea6e5f60815052b06e9b2 Mon Sep 17 00:00:00 2001 From: rabu3082 Date: Fri, 3 Jun 2016 19:50:02 +0200 Subject: [PATCH 030/107] corrects syntax for testing an analyzer (#505) TODO: the json for Sense has to be corrected as well (it says: "child \"uri\" fails because [\"uri\" must be a valid uri]"") --- 130_Partial_Matching/35_Search_as_you_type.asciidoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/130_Partial_Matching/35_Search_as_you_type.asciidoc b/130_Partial_Matching/35_Search_as_you_type.asciidoc index aa110867b..96485ebc5 100644 --- a/130_Partial_Matching/35_Search_as_you_type.asciidoc +++ b/130_Partial_Matching/35_Search_as_you_type.asciidoc @@ -93,7 +93,9 @@ the `analyze` API: [source,js] -------------------------------------------------- GET /my_index/_analyze?analyzer=autocomplete -quick brown +{ + "text": "quick brown" +} -------------------------------------------------- // SENSE: 130_Partial_Matching/35_Search_as_you_type.json From daf8ba6f7083bb8d1520322b1d588b6d8d476089 Mon Sep 17 00:00:00 2001 From: rabu3082 Date: Fri, 3 Jun 2016 19:51:15 +0200 Subject: [PATCH 031/107] corrects syntax for testing an analyzer in json for Sense (#506) at the moment, you are confronted with { "statusCode": 400, "error": "Bad Request", "message": "child \"uri\" fails because [\"uri\" must be a valid uri]", "validation": { "source": "query", "keys": [ "uri" ] } } when sending the request. --- snippets/130_Partial_Matching/35_Search_as_you_type.json | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/snippets/130_Partial_Matching/35_Search_as_you_type.json b/snippets/130_Partial_Matching/35_Search_as_you_type.json index ee8c86f11..6a39e14ed 100644 --- a/snippets/130_Partial_Matching/35_Search_as_you_type.json +++ b/snippets/130_Partial_Matching/35_Search_as_you_type.json @@ -31,7 +31,10 @@ PUT /my_index } # Test the autocomplete analyzer -GET /my_index/_analyze?analyzer=autocomplete&text=quick brown +GET /my_index/_analyze?analyzer=autocomplete +{ + "text": "quick brown" +} # Map the `name` field to use the `autocomplete` analyzer PUT /my_index/_mapping/mytype From 5a427863888604fb7474ab01038f6fccc8cdc47f Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 13:52:35 -0400 Subject: [PATCH 032/107] Move analyzer into request body --- 130_Partial_Matching/35_Search_as_you_type.asciidoc | 6 +++--- snippets/130_Partial_Matching/35_Search_as_you_type.json | 3 ++- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/130_Partial_Matching/35_Search_as_you_type.asciidoc b/130_Partial_Matching/35_Search_as_you_type.asciidoc index 96485ebc5..027cfb460 100644 --- a/130_Partial_Matching/35_Search_as_you_type.asciidoc +++ b/130_Partial_Matching/35_Search_as_you_type.asciidoc @@ -92,9 +92,10 @@ the `analyze` API: [source,js] -------------------------------------------------- -GET /my_index/_analyze?analyzer=autocomplete +GET /my_index/_analyze { - "text": "quick brown" + "analyzer": "autocomplete", + "text": "quick brown" } -------------------------------------------------- // SENSE: 130_Partial_Matching/35_Search_as_you_type.json @@ -358,4 +359,3 @@ This example uses the `keyword` tokenizer to convert the postcode string into a to turn postcodes into edge n-grams. <2> The `postcode_search` analyzer would treat search terms as if they were `not_analyzed`. - diff --git a/snippets/130_Partial_Matching/35_Search_as_you_type.json b/snippets/130_Partial_Matching/35_Search_as_you_type.json index 6a39e14ed..8af6e7f06 100644 --- a/snippets/130_Partial_Matching/35_Search_as_you_type.json +++ b/snippets/130_Partial_Matching/35_Search_as_you_type.json @@ -31,8 +31,9 @@ PUT /my_index } # Test the autocomplete analyzer -GET /my_index/_analyze?analyzer=autocomplete +GET /my_index/_analyze { + "analyzer": "autocomplete", "text": "quick brown" } From 2ad9bda249be7e3495db828c45e570b8df867a07 Mon Sep 17 00:00:00 2001 From: Sumit Gupta Date: Fri, 3 Jun 2016 23:25:10 +0530 Subject: [PATCH 033/107] Update 62_Geo_distance_agg.asciidoc (#546) Query Breaking due to comma in gwo_bounding_box lat value. I added point instead of comma in geo_bounding_box lat value. --- 330_Geo_aggs/62_Geo_distance_agg.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/330_Geo_aggs/62_Geo_distance_agg.asciidoc b/330_Geo_aggs/62_Geo_distance_agg.asciidoc index c9e838ef9..f5c50ea29 100644 --- a/330_Geo_aggs/62_Geo_distance_agg.asciidoc +++ b/330_Geo_aggs/62_Geo_distance_agg.asciidoc @@ -21,7 +21,7 @@ GET /attractions/restaurant/_search "geo_bounding_box": { "location": { <2> "top_left": { - "lat": 40,8, + "lat": 40.8, "lon": -74.1 }, "bottom_right": { From 815434a32a07aa0bdc08e08d69ba83cc8e3b7c06 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 13:59:32 -0400 Subject: [PATCH 034/107] Use new body specification for analyze API Related to #509 --- .../30_Controlling_analysis.asciidoc | 14 ++++++++++---- snippets/100_Full_Text_Search/30_Analysis.json | 12 ++++++++++-- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/100_Full_Text_Search/30_Controlling_analysis.asciidoc b/100_Full_Text_Search/30_Controlling_analysis.asciidoc index d5a2091d1..fffd6bb93 100644 --- a/100_Full_Text_Search/30_Controlling_analysis.asciidoc +++ b/100_Full_Text_Search/30_Controlling_analysis.asciidoc @@ -34,11 +34,17 @@ analyzed at index time by using the `analyze` API to analyze the word `Foxes`: [source,js] -------------------------------------------------- -GET /my_index/_analyze?field=my_type.title <1> -Foxes +GET /my_index/_analyze +{ + "field": "my_type.title", <1> + "text": "Foxes" +} -GET /my_index/_analyze?field=my_type.english_title <2> -Foxes +GET /my_index/_analyze +{ + "field": "my_type.english_title", <2> + "text": "Foxes" +} -------------------------------------------------- // SENSE: 100_Full_Text_Search/30_Analysis.json diff --git a/snippets/100_Full_Text_Search/30_Analysis.json b/snippets/100_Full_Text_Search/30_Analysis.json index 76e316e2f..2a692c217 100644 --- a/snippets/100_Full_Text_Search/30_Analysis.json +++ b/snippets/100_Full_Text_Search/30_Analysis.json @@ -22,10 +22,18 @@ PUT /my_index } # Test the analysis of the `title` field -GET /my_index/_analyze?field=my_type.title&text=Foxes +GET /my_index/_analyze +{ + "field": "my_type.title", <1> + "text": "Foxes" +} # Test the analysis of the `english_title` field -GET /my_index/_analyze?field=my_type.english_title&text=Foxes +GET /my_index/_analyze +{ + "field": "my_type.english_title", <2> + "text": "Foxes" +} # Get query explanation for `title` vs `english_title` GET /my_index/my_type/_validate/query?explain From 775f935928f8a17bfb9765eccac5b8a2ce4fd280 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 14:01:49 -0400 Subject: [PATCH 035/107] Remove unnecessary braces Closes #440 --- 080_Structured_Search/30_existsmissing.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/080_Structured_Search/30_existsmissing.asciidoc b/080_Structured_Search/30_existsmissing.asciidoc index 9a9a1a8da..f3ad31570 100644 --- a/080_Structured_Search/30_existsmissing.asciidoc +++ b/080_Structured_Search/30_existsmissing.asciidoc @@ -247,8 +247,8 @@ is really executed as { "bool": { "should": [ - { "exists": { "field": { "name.first" }}}, - { "exists": { "field": { "name.last" }}} + { "exists": { "field": "name.first" }}, + { "exists": { "field": "name.last" }} ] } } From d953fb56bb03fc4af215f04351fd06666ad1819e Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 14:03:42 -0400 Subject: [PATCH 036/107] Remove stray comma Closes #438 --- 100_Full_Text_Search/15_Combining_queries.asciidoc | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/100_Full_Text_Search/15_Combining_queries.asciidoc b/100_Full_Text_Search/15_Combining_queries.asciidoc index 20f8b2fc4..ee02fadba 100644 --- a/100_Full_Text_Search/15_Combining_queries.asciidoc +++ b/100_Full_Text_Search/15_Combining_queries.asciidoc @@ -1,7 +1,7 @@ [[bool-query]] === Combining Queries -In <> we discussed how to((("full text search", "combining queries"))), use the `bool` filter to combine +In <> we discussed how to((("full text search", "combining queries"))) use the `bool` filter to combine multiple filter clauses with `and`, `or`, and `not` logic. In query land, the `bool` query does a similar job but with one important difference. @@ -107,4 +107,3 @@ The results would include only documents whose `title` field contains `"brown" AND "fox"`, `"brown" AND "dog"`, or `"fox" AND "dog"`. If a document contains all three, it would be considered more relevant than those that contain just two of the three. - From 6f3772843b8e2b986a55f287fde45528bd2a4bea Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 14:17:41 -0400 Subject: [PATCH 037/107] "found" not "exists" Closes #361 --- 030_Data/15_Get.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/030_Data/15_Get.asciidoc b/030_Data/15_Get.asciidoc index 26b7ded68..3258046e9 100644 --- a/030_Data/15_Get.asciidoc +++ b/030_Data/15_Get.asciidoc @@ -93,7 +93,7 @@ filtered out the `date` field: "_type" : "blog", "_id" : "123", "_version" : 1, - "exists" : true, + "found" : true, "_source" : { "title": "My first blog entry" , "text": "Just trying this out..." From 19a5bfc481fd3188afd4ceb724f452b091fe10ab Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 3 Jun 2016 14:22:25 -0400 Subject: [PATCH 038/107] Add semi-colon Closes #329 --- 230_Stemming/00_Intro.asciidoc | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/230_Stemming/00_Intro.asciidoc b/230_Stemming/00_Intro.asciidoc index c4f06931f..78be00e07 100644 --- a/230_Stemming/00_Intro.asciidoc +++ b/230_Stemming/00_Intro.asciidoc @@ -36,7 +36,7 @@ and overstemming. _Understemming_ is the failure to reduce words with the same meaning to the same root. For example, `jumped` and `jumps` may be reduced to `jump`, while -`jumping` may be reduced to `jumpi`. Understemming reduces retrieval +`jumping` may be reduced to `jumpi`. Understemming reduces retrieval; relevant documents are not returned. _Overstemming_ is the failure to keep two words with distinct meanings separate. @@ -69,6 +69,3 @@ First we will discuss the two classes of stemmers available in Elasticsearch choose the right stemmer for your needs in <>. Finally, we will discuss options for tailoring stemming in <> and <>. - - - From 78448552d4768cef5ba0d07414de686c83bf3818 Mon Sep 17 00:00:00 2001 From: Olim Saidov Date: Fri, 3 Jun 2016 23:33:12 +0500 Subject: [PATCH 039/107] Fixed type in Empty Fields example (#311) From bcb0a63d5e7b4e661e390a50c5c5f082960da0be Mon Sep 17 00:00:00 2001 From: jasiustasiu Date: Fri, 3 Jun 2016 20:36:42 +0200 Subject: [PATCH 040/107] SQL example for finding distinct counts corrected (#306) --- 300_Aggregations/60_cardinality.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/300_Aggregations/60_cardinality.asciidoc b/300_Aggregations/60_cardinality.asciidoc index e92aef310..0ec11c9b1 100644 --- a/300_Aggregations/60_cardinality.asciidoc +++ b/300_Aggregations/60_cardinality.asciidoc @@ -7,7 +7,7 @@ _unique_ count. ((("unique counts"))) You may be familiar with the SQL version: [source, sql] -------- -SELECT DISTINCT(color) +SELECT COUNT(DISTINCT color) FROM cars -------- From f5d833778f2266c9122209ab41c8d961cc64a992 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Thu, 30 Jun 2016 11:44:08 +0200 Subject: [PATCH 041/107] Use custom page_header.html --- page_header.html | 1 + 1 file changed, 1 insertion(+) create mode 100644 page_header.html diff --git a/page_header.html b/page_header.html new file mode 100644 index 000000000..10f6e4b59 --- /dev/null +++ b/page_header.html @@ -0,0 +1 @@ +PLEASE NOTE:
We are working on updating this book for the latest version. Some content might be out of date. \ No newline at end of file From 5af1d06a4111918a524a70bc28310aeeca1716b6 Mon Sep 17 00:00:00 2001 From: Sohrab Date: Tue, 19 Jul 2016 02:18:20 +1000 Subject: [PATCH 042/107] Add missing comma (#570) --- 052_Mapping_Analysis/45_Mapping.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/052_Mapping_Analysis/45_Mapping.asciidoc b/052_Mapping_Analysis/45_Mapping.asciidoc index 8408ce78b..81c781f4a 100644 --- a/052_Mapping_Analysis/45_Mapping.asciidoc +++ b/052_Mapping_Analysis/45_Mapping.asciidoc @@ -280,7 +280,7 @@ name. Compare the output of these two requests: -------------------------------------------------- GET /gb/_analyze { - "field": "tweet" + "field": "tweet", "text": "Black-cats" <1> } From f9bcb5d1c259cb23c8b41f8947abd27733c42946 Mon Sep 17 00:00:00 2001 From: Igor Dubinskiy Date: Mon, 18 Jul 2016 09:19:05 -0700 Subject: [PATCH 043/107] Fix typo (#567) --- 120_Proximity_Matching/35_Shingles.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/120_Proximity_Matching/35_Shingles.asciidoc b/120_Proximity_Matching/35_Shingles.asciidoc index f533b604a..e15ccb8d4 100644 --- a/120_Proximity_Matching/35_Shingles.asciidoc +++ b/120_Proximity_Matching/35_Shingles.asciidoc @@ -12,7 +12,7 @@ you can't tell whether _Sue ate_ or the _alligator ate_. When words are used in conjunction with each other, they express an idea that is bigger or more meaningful than each word in isolation. The two clauses -_I'm not happy I'm working_ and _I'm happy I'm not working_ contain the sames words, in +_I'm not happy I'm working_ and _I'm happy I'm not working_ contain the same words, in close proximity, but have quite different meanings. If, instead of indexing each word independently, we were to index pairs of From 89519cb98bcc227b315198ace200037cee8228cb Mon Sep 17 00:00:00 2001 From: Peter Dyson Date: Tue, 19 Jul 2016 02:19:31 +1000 Subject: [PATCH 044/107] Updates to java recommendations (#566) --- 510_Deployment/30_other.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/510_Deployment/30_other.asciidoc b/510_Deployment/30_other.asciidoc index 5e0ab281e..dda8ae758 100644 --- a/510_Deployment/30_other.asciidoc +++ b/510_Deployment/30_other.asciidoc @@ -8,7 +8,7 @@ tests from Lucene often expose bugs in the JVM itself. These bugs range from mild annoyances to serious segfaults, so it is best to use the latest version of the JVM where possible. -Java 7 is strongly preferred over Java 6. Either Oracle or OpenJDK are acceptable. They are comparable in performance and stability. +Java 8 is preferred over Java 7. Java 6 is no longer supported. Either Oracle or OpenJDK are acceptable. They are comparable in performance and stability. If your application is written in Java and you are using the transport client or node client, make sure the JVM running your application is identical to the From d4b06c56ea57de61bbd483c974a040ffa82ef203 Mon Sep 17 00:00:00 2001 From: Jakob Reiter Date: Mon, 18 Jul 2016 18:34:04 +0200 Subject: [PATCH 045/107] Changed "field": "employee.hobby" to "field": "hobby" (#558) Changed "field": "employee.hobby" to "field": "hobby", otherwise the hobbies are not returned. --- 404_Parent_Child/60_Children_agg.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/404_Parent_Child/60_Children_agg.asciidoc b/404_Parent_Child/60_Children_agg.asciidoc index 1d9accc59..6af80f0ec 100644 --- a/404_Parent_Child/60_Children_agg.asciidoc +++ b/404_Parent_Child/60_Children_agg.asciidoc @@ -27,7 +27,7 @@ GET /company/branch/_search "aggs": { "hobby": { "terms": { <3> - "field": "employee.hobby" + "field": "hobby" } } } From 1d14d2b21a62d5492fcd3fc1d8c85120d005dd2f Mon Sep 17 00:00:00 2001 From: Shubham Aggarwal Date: Mon, 18 Jul 2016 22:27:14 +0530 Subject: [PATCH 046/107] Fix missing space (#552) --- 054_Query_DSL/70_Important_clauses.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/054_Query_DSL/70_Important_clauses.asciidoc b/054_Query_DSL/70_Important_clauses.asciidoc index 0cf6ea486..efa7def97 100644 --- a/054_Query_DSL/70_Important_clauses.asciidoc +++ b/054_Query_DSL/70_Important_clauses.asciidoc @@ -145,7 +145,7 @@ for exact matches (including differences in case, accents, spaces, etc). The `exists` and `missing` queries are ((("exists query")))((("missing query")))used to find documents in which the specified field either has one or more values (`exists`) or doesn't have any values (`missing`). It is similar in nature to `IS_NULL` (`missing`) and `NOT -IS_NULL` (`exists`)in SQL: +IS_NULL` (`exists`) in SQL: [source,js] -------------------------------------------------- From f8a87b01857c84b7f248eec2c9f97c90a1f4a779 Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Mon, 25 Jul 2016 21:24:31 +0200 Subject: [PATCH 047/107] Fix description of the `timeout` parameter. (#574) Closes #536 --- .../15_Search_options.asciidoc | 28 ++++++++----------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/060_Distributed_Search/15_Search_options.asciidoc b/060_Distributed_Search/15_Search_options.asciidoc index af8237a20..2d9e4ec37 100644 --- a/060_Distributed_Search/15_Search_options.asciidoc +++ b/060_Distributed_Search/15_Search_options.asciidoc @@ -33,34 +33,30 @@ like the user's session ID. ==== timeout -By default, the coordinating node waits((("search options", "timeout"))) to receive a response from all shards. +By default, shards process all the data they have before returning a response to +the coordinating node, which will in turn merge these responses to build the +final response. + +This means that the time it takes to run a search request is the sum of the time +it takes to process the slowest shard and the time it takes to merge responses. If one node is having trouble, it could slow down the response to all search requests. -The `timeout` parameter tells((("timeout parameter"))) the coordinating node how long it should wait -before giving up and just returning the results that it already has. It can be -better to return some results than none at all. +The `timeout` parameter tells((("timeout parameter"))) shards how long they +are allowed to process data before returning a response to the coordinating +node. If there was not enough time to process all data, results for this shard +will be partial, even possibly empty. -The response to a search request will indicate whether the search timed out and -how many shards responded successfully: +The response to a search request will indicate whether any shards returned a +partial response with the `timed_out` property: [source,js] -------------------------------------------------- ... "timed_out": true, <1> - "_shards": { - "total": 5, - "successful": 4, - "failed": 1 <2> - }, ... -------------------------------------------------- <1> The search request timed out. -<2> One shard out of five failed to respond in time. - -If all copies of a shard fail for other reasons--perhaps because of a -hardware failure--this will also be reflected in the `_shards` section of -the response. [[search-routing]] ==== routing From 37c514a6f7a22aec8ce50a65ff3a2a8296035cac Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Mon, 25 Jul 2016 15:33:48 -0400 Subject: [PATCH 048/107] Add brief warning about timeout best-effort --- .../15_Search_options.asciidoc | 20 +++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/060_Distributed_Search/15_Search_options.asciidoc b/060_Distributed_Search/15_Search_options.asciidoc index 2d9e4ec37..bd2a4d8b9 100644 --- a/060_Distributed_Search/15_Search_options.asciidoc +++ b/060_Distributed_Search/15_Search_options.asciidoc @@ -18,7 +18,7 @@ the _bouncing results_ problem.((("bouncing results problem"))) .Bouncing Results **** -Imagine that you are sorting your results by a `timestamp` field, and +Imagine that you are sorting your results by a `timestamp` field, and two documents have the same timestamp. Because search requests are round-robined between all available shard copies, these two documents may be returned in one order when the request is served by the primary, and in @@ -58,6 +58,22 @@ partial response with the `timed_out` property: -------------------------------------------------- <1> The search request timed out. +[WARNING] +==== +It's important to know that the timeout is still a best-effort operation; it's +possible for the query to surpass the allotted timeout. There are two reasons for +this behavior: + +1. Timeout checks are performed on a per-document basis. However, some query types +have a significant amount of work that must be performed *before* documents are evaluated. +This "setup" phase does not consult the timeout, and so very long setup times can cause +the overall latency to shoot past the timeout. +2. Because the time is once per document, a very long query can execute on a single +document and it won't timeout until the next document is evaluated. This also means +poorly written scripts (e.g. ones with infinite loops) will be allowed to execute +forever. +==== + [[search-routing]] ==== routing @@ -79,7 +95,7 @@ discuss it in detail in <>. ==== search_type The default search type is `query_then_fetch` ((("query_then_fetch search type")))((("search options", "search_type")))((("search_type"))). In some cases, you might want to explicitly set the `search_type` -to `dfs_query_then_fetch` to improve the accuracy of relevance scoring: +to `dfs_query_then_fetch` to improve the accuracy of relevance scoring: [source,js] -------------------------------------------------- From 8d02545060a16fac518ea1589d9256a834081b8c Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Thu, 18 Aug 2016 12:46:01 +0200 Subject: [PATCH 049/107] Update 60_file_descriptors.asciidoc max_file_descriptors is now found in node stats, not nodes info --- 510_Deployment/60_file_descriptors.asciidoc | 43 ++++++++++++--------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/510_Deployment/60_file_descriptors.asciidoc b/510_Deployment/60_file_descriptors.asciidoc index 51b2a7c6f..41a675086 100644 --- a/510_Deployment/60_file_descriptors.asciidoc +++ b/510_Deployment/60_file_descriptors.asciidoc @@ -19,27 +19,32 @@ have enough file descriptors: [source,js] ---- -GET /_nodes/process - { - "cluster_name": "elasticsearch__zach", - "nodes": { - "TGn9iO2_QQKb0kavcLbnDw": { - "name": "Zach", - "transport_address": "inet[/192.168.1.131:9300]", - "host": "zacharys-air", - "ip": "192.168.1.131", - "version": "2.0.0-SNAPSHOT", - "build": "612f461", - "http_address": "inet[/192.168.1.131:9200]", - "process": { - "refresh_interval_in_millis": 1000, - "id": 19808, - "max_file_descriptors": 64000, <1> - "mlockall": true - } + "cluster_name": "elasticsearch", + "nodes": { + "nLd81iLsRcqmah-cuHAbaQ": { + "timestamp": 1471516160318, + "name": "Marsha Rosenberg", + "transport_address": "127.0.0.1:9300", + "host": "127.0.0.1", + "ip": [ + "127.0.0.1:9300", + "NONE" + ], + "process": { + "timestamp": 1471516160318, + "open_file_descriptors": 155, + "max_file_descriptors": 10240, <1> + "cpu": { + "percent": 0, + "total_in_millis": 25084 + }, + "mem": { + "total_virtual_in_bytes": 5221900288 + } } - } + } + } } ---- <1> The `max_file_descriptors` field shows the number of available descriptors that From f1aa6bcf16aeb6ca847346fd626da65c1f3c8ca7 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Fri, 2 Sep 2016 14:08:33 +0200 Subject: [PATCH 050/107] Removed "PLEASE NOTE" from the page header --- page_header.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/page_header.html b/page_header.html index 10f6e4b59..e9ec2bc89 100644 --- a/page_header.html +++ b/page_header.html @@ -1 +1 @@ -PLEASE NOTE:
We are working on updating this book for the latest version. Some content might be out of date. \ No newline at end of file +We are working on updating this book for the latest version. Some content might be out of date. \ No newline at end of file From 28f3b93ed394b77a5ca9f8d4dc75441d9eb85ff2 Mon Sep 17 00:00:00 2001 From: debadair Date: Tue, 13 Sep 2016 17:33:18 -0700 Subject: [PATCH 051/107] Removed link to geohash_cell query in the reference, and added a note that it has been removed. --- 320_Geohashes/40_Geohashes.asciidoc | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/320_Geohashes/40_Geohashes.asciidoc b/320_Geohashes/40_Geohashes.asciidoc index e756ab090..93b76bc40 100644 --- a/320_Geohashes/40_Geohashes.asciidoc +++ b/320_Geohashes/40_Geohashes.asciidoc @@ -1,6 +1,9 @@ [[geohashes]] == Geohashes +NOTE: 5.0 introduces a `LatLonPoint` type and support for `geohash_cell` queries +has been removed. + http://en.wikipedia.org/wiki/Geohash[Geohashes] are a way of encoding `lat/lon` points as strings.((("geohashes")))((("latitude/longitude pairs", "encoding lat/lon points as strings with geohashes")))((("strings", "geohash"))) The original intention was to have a URL-friendly way of specifying geolocations, but geohashes have turned out to @@ -47,6 +50,6 @@ along with the approximate dimensions of each geohash cell: |gcpuuz94kkp5 |12 | ~ 3.7cm x 1.8cm |============================================= -The {ref}/query-dsl-geohash-cell-query.html[`geohash_cell` filter] can use -these geohash prefixes((("geohash_cell filter")))((("filters", "geohash_cell"))) to find locations near a specified `lat/lon` point. +The `geohash_cell` filter can use these geohash prefixes((("geohash_cell filter"))) +((("filters", "geohash_cell"))) to find locations near a specified `lat/lon` point. From 65a9fd0654c61ccb3443b18a2cb21a46a6a5be7b Mon Sep 17 00:00:00 2001 From: Austin Chu Date: Thu, 6 Oct 2016 11:47:22 -0400 Subject: [PATCH 052/107] Fix minor typo in Marvel for Monitoring page (#612) --- 500_Cluster_Admin/15_marvel.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/500_Cluster_Admin/15_marvel.asciidoc b/500_Cluster_Admin/15_marvel.asciidoc index bca6bfab0..399f41f64 100644 --- a/500_Cluster_Admin/15_marvel.asciidoc +++ b/500_Cluster_Admin/15_marvel.asciidoc @@ -15,7 +15,7 @@ behavior over time, which makes it easy to spot trends. As your cluster grows, the output from the stats APIs can get truly hairy. Once you have a dozen nodes, let alone a hundred, reading through stacks of JSON -becomes very tedious. Marvel lets your explore the data interactively and +becomes very tedious. Marvel lets you explore the data interactively and makes it easy to zero in on what's going on with particular nodes or indices. Marvel uses the same stats APIs that are available to you--it does not expose From daec1112ea7a49d8747c11d0c3af38c45bc3d0a4 Mon Sep 17 00:00:00 2001 From: Austin Chu Date: Thu, 6 Oct 2016 11:47:49 -0400 Subject: [PATCH 053/107] Add missing "you" in Most Important Queries page (#611) In Getting Started >> Full-Body Search. --- 054_Query_DSL/70_Important_clauses.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/054_Query_DSL/70_Important_clauses.asciidoc b/054_Query_DSL/70_Important_clauses.asciidoc index efa7def97..e0ac93639 100644 --- a/054_Query_DSL/70_Important_clauses.asciidoc +++ b/054_Query_DSL/70_Important_clauses.asciidoc @@ -60,7 +60,7 @@ it is not prone to throwing syntax errors. ==== multi_match Query -The `multi_match` query allows((("multi_match queries"))) to run the same `match` query on multiple +The `multi_match` query allows((("multi_match queries"))) you to run the same `match` query on multiple fields: [source,js] From f68aba2a2322d5941d3fa3143d12c08ec0c3a7c5 Mon Sep 17 00:00:00 2001 From: Austin Chu Date: Thu, 6 Oct 2016 11:48:33 -0400 Subject: [PATCH 054/107] Update text describing how to call the analyze API (#610) Update the body text to reflect the changes made in 89dc347778ce679fab2183c9ebe4ebbf77cab55c. --- 052_Mapping_Analysis/40_Analysis.asciidoc | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/052_Mapping_Analysis/40_Analysis.asciidoc b/052_Mapping_Analysis/40_Analysis.asciidoc index 3244139bf..43cb20b04 100644 --- a/052_Mapping_Analysis/40_Analysis.asciidoc +++ b/052_Mapping_Analysis/40_Analysis.asciidoc @@ -154,8 +154,7 @@ GET /_search?q=date:2014 # 0 results ! Especially when you are new ((("analyzers", "testing")))to Elasticsearch, it is sometimes difficult to understand what is actually being tokenized and stored into your index. To better understand what is going on, you can use the `analyze` API to see how -text is analyzed. Specify which analyzer to use in the query-string -parameters, and the text to analyze in the body: +text is analyzed: [source,js] -------------------------------------------------- From 939665d3d07ee5240504ac127b1e15f66954e18d Mon Sep 17 00:00:00 2001 From: Hsu Chen-Wei Date: Thu, 6 Oct 2016 10:58:47 -0500 Subject: [PATCH 055/107] Fix text is missing (#593) Fix text is missing error --- 070_Index_Mgmt/20_Custom_Analyzers.asciidoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc index e930833c2..5e9d7e486 100644 --- a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc +++ b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc @@ -171,7 +171,9 @@ After creating the index, use the `analyze` API to((("analyzers", "testing using [source,js] -------------------------------------------------- GET /my_index/_analyze?analyzer=my_analyzer -The quick & brown fox +{ + "text": "The quick & brown fox" +} -------------------------------------------------- // SENSE: 070_Index_Mgmt/20_Custom_analyzer.json From 293f2a9aed19aca6d6b401b03f30e767424a6777 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Thu, 6 Oct 2016 12:02:13 -0400 Subject: [PATCH 056/107] Remove analyzer in URL, update snippet --- 070_Index_Mgmt/20_Custom_Analyzers.asciidoc | 5 +++-- snippets/070_Index_Mgmt/20_Custom_analyzer.json | 6 +++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc index 5e9d7e486..bf7c6ee11 100644 --- a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc +++ b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc @@ -170,9 +170,10 @@ After creating the index, use the `analyze` API to((("analyzers", "testing using [source,js] -------------------------------------------------- -GET /my_index/_analyze?analyzer=my_analyzer +GET /my_index/_analyze { - "text": "The quick & brown fox" + "text": "The quick & brown fox", + "analyzer": "my_analyzer" } -------------------------------------------------- // SENSE: 070_Index_Mgmt/20_Custom_analyzer.json diff --git a/snippets/070_Index_Mgmt/20_Custom_analyzer.json b/snippets/070_Index_Mgmt/20_Custom_analyzer.json index 2f11e6ba6..04202c1d3 100644 --- a/snippets/070_Index_Mgmt/20_Custom_analyzer.json +++ b/snippets/070_Index_Mgmt/20_Custom_analyzer.json @@ -42,7 +42,11 @@ PUT /my_index } # Test out the new analyzer -GET /my_index/_analyze?analyzer=my_analyzer&text=The quick %26 brown fox +GET /my_index/_analyze +{ + "text": "The quick & brown fox", + "analyzer": "my_analyzer" +} # Apply "my_analyzer" to the `title` field PUT /my_index/_mapping/my_type From 4f6b6a8b54a129fbca665aa11b50b756d888f813 Mon Sep 17 00:00:00 2001 From: Hsu Chen-Wei Date: Thu, 6 Oct 2016 11:04:00 -0500 Subject: [PATCH 057/107] Fix text is missing (#592) Fix text is missing error --- 070_Index_Mgmt/15_Configure_Analyzer.asciidoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc index eb6617435..51293db15 100644 --- a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc +++ b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc @@ -54,7 +54,9 @@ specify the index name: [source,js] -------------------------------------------------- GET /spanish_docs/_analyze?analyzer=es_std -El veloz zorro marrón +{ + "text":"El veloz zorro marrón" +} -------------------------------------------------- // SENSE: 070_Index_Mgmt/15_Configure_Analyzer.json From 09d72654c5fc519a26bc6593985ab9ebc2b40dd1 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Thu, 6 Oct 2016 12:05:04 -0400 Subject: [PATCH 058/107] Remove analyzer in URL, update snippet --- 070_Index_Mgmt/15_Configure_Analyzer.asciidoc | 4 ++-- snippets/070_Index_Mgmt/15_Configure_Analyzer.json | 7 +++++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc index 51293db15..ad89d0b20 100644 --- a/070_Index_Mgmt/15_Configure_Analyzer.asciidoc +++ b/070_Index_Mgmt/15_Configure_Analyzer.asciidoc @@ -53,8 +53,9 @@ specify the index name: [source,js] -------------------------------------------------- -GET /spanish_docs/_analyze?analyzer=es_std +GET /spanish_docs/_analyze { + "analyzer": "es_std", "text":"El veloz zorro marrón" } -------------------------------------------------- @@ -73,4 +74,3 @@ removed correctly: ] } -------------------------------------------------- - diff --git a/snippets/070_Index_Mgmt/15_Configure_Analyzer.json b/snippets/070_Index_Mgmt/15_Configure_Analyzer.json index 40aa2b996..6af3cd3c1 100644 --- a/snippets/070_Index_Mgmt/15_Configure_Analyzer.json +++ b/snippets/070_Index_Mgmt/15_Configure_Analyzer.json @@ -17,5 +17,8 @@ PUT /spanish_docs } # Test out the new analyzer -GET /spanish_docs/_analyze?analyzer=es_std&text=El veloz zorro marrón - +GET /spanish_docs/_analyze +{ + "analyzer": "es_std", + "text":"El veloz zorro marrón" +} From 03c3ba420b7d25e8442051e386fd6fe80c3bc3ea Mon Sep 17 00:00:00 2001 From: kingrhoton Date: Thu, 6 Oct 2016 09:16:12 -0700 Subject: [PATCH 059/107] make text consistent with example (#575) --- 030_Data/15_Get.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/030_Data/15_Get.asciidoc b/030_Data/15_Get.asciidoc index 3258046e9..eee8d283f 100644 --- a/030_Data/15_Get.asciidoc +++ b/030_Data/15_Get.asciidoc @@ -72,7 +72,7 @@ Content-Length: 83 ==== Retrieving Part of a Document By default, a `GET` request((("documents", "retrieving part of"))) will return the whole document, as stored in the -`_source` field. But perhaps all you are interested in is the `title` field. +`_source` field. But perhaps all you are interested in are the `title` and `text` fields. Individual fields can be ((("fields", "returning individual document fields")))((("_source field", sortas="source field")))requested by using the `_source` parameter. Multiple fields can be specified in a comma-separated list: From 294895b4a932b20489a8f82bc70a55a32a68a9b8 Mon Sep 17 00:00:00 2001 From: Oliver Veits Date: Thu, 24 Nov 2016 11:27:19 +0100 Subject: [PATCH 060/107] Append '.keyword' to field Same as in https://github.com/elastic/elasticsearch/pull/17942/files: it seems like you always need to append '.keyword' to the field for aggregation... I am new to elasticsearch. For me, the error message I got without .keyword, namely ``` "reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [color] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory." ``` seems to be misleading. At least, the proposed workaround to follow https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#_enabling_fielddata_on_literal_text_literal_fields did not work for me. I tried ``` curl -XPUT 'localhost:9200/cars/transactions/color?pretty' -d' { "properties": { "my_field": { "type": "text", "fielddata": false } } }' ``` I was not sure on the my_type, so I tried with ```PUT 'localhost:9200/cars/transactions/popular_colors``` as well. --- 300_Aggregations/20_basic_example.asciidoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/300_Aggregations/20_basic_example.asciidoc b/300_Aggregations/20_basic_example.asciidoc index 2c3281000..a8759a384 100644 --- a/300_Aggregations/20_basic_example.asciidoc +++ b/300_Aggregations/20_basic_example.asciidoc @@ -52,7 +52,7 @@ GET /cars/transactions/_search "aggs" : { <1> "popular_colors" : { <2> "terms" : { <3> - "field" : "color" + "field" : "color.keyword" } } } @@ -96,10 +96,12 @@ Let's execute that aggregation and take a look at the results: { ... "hits": { + ... "hits": [] <1> }, "aggregations": { "popular_colors": { <2> + ... "buckets": [ { "key": "red", <3> From e98d0881fcb21a69c9ddd9a0427b1919892510a7 Mon Sep 17 00:00:00 2001 From: Glen Smith Date: Mon, 2 Jan 2017 18:08:42 +0100 Subject: [PATCH 061/107] Add synced flush resolves https://github.com/elastic/elasticsearch-definitive-guide/issues/408 --- 520_Post_Deployment/40_rolling_restart.asciidoc | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/520_Post_Deployment/40_rolling_restart.asciidoc b/520_Post_Deployment/40_rolling_restart.asciidoc index 1aa93dc4f..77076b0b0 100644 --- a/520_Post_Deployment/40_rolling_restart.asciidoc +++ b/520_Post_Deployment/40_rolling_restart.asciidoc @@ -20,9 +20,14 @@ What we want to do is tell Elasticsearch to hold off on rebalancing, because we have more knowledge about the state of the cluster due to external factors. The procedure is as follows: -1. If possible, stop indexing new data. This is not always possible, but will +1. If possible, stop indexing new data and perform a synced flush. This is not always possible, but will help speed up recovery time. - +A synced flush request is a “best effort” operation. It will fail if there are any pending indexing operations, but it is safe to reissue the request multiple times if necessary. ++ +[source,js] +---- +POST /_flush/synced +---- 2. Disable shard allocation. This prevents Elasticsearch from rebalancing missing shards until you tell it otherwise. If you know the maintenance window will be short, this is a good idea. You can disable allocation as follows: From 3dfb077831d00319aabbc6cc56ff008295bfea2f Mon Sep 17 00:00:00 2001 From: orhiee Date: Wed, 4 Jan 2017 18:29:04 +0000 Subject: [PATCH 062/107] update to fix issues in documentation update the documentation links --- 010_Intro/10_Installing_ES.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc index bcc69e3b5..0bb101cab 100644 --- a/010_Intro/10_Installing_ES.asciidoc +++ b/010_Intro/10_Installing_ES.asciidoc @@ -100,7 +100,7 @@ Elasticsearch cluster. + [source,sh] -------------------------------------------------- -./bin/kibana plugin --install elastic/sense <1> +./bin/kibana-plugin install elastic/sense <1> -------------------------------------------------- <1> Windows: `bin\kibana.bat plugin --install elastic/sense`. + From 912e887c6fc2822a67005a079ba946a58b1901e4 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 11 Apr 2017 09:39:15 +0200 Subject: [PATCH 063/107] Ignore the .idea directory --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index 5a114f0b9..1c363e9f3 100644 --- a/.gitignore +++ b/.gitignore @@ -6,3 +6,5 @@ book.html .settings .DS_Store + +.idea \ No newline at end of file From 12af5e9a36bc5e531781e3cb9e8615ad72e447c7 Mon Sep 17 00:00:00 2001 From: Oliver Veits Date: Tue, 11 Apr 2017 10:20:05 +0200 Subject: [PATCH 064/107] Append .keyword to field in add metric aggs example Relates elastic/elasticsearch#17188 --- 300_Aggregations/21_add_metric.asciidoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/300_Aggregations/21_add_metric.asciidoc b/300_Aggregations/21_add_metric.asciidoc index 44b80127a..cfc1395f3 100644 --- a/300_Aggregations/21_add_metric.asciidoc +++ b/300_Aggregations/21_add_metric.asciidoc @@ -20,7 +20,7 @@ GET /cars/transactions/_search "aggs": { "colors": { "terms": { - "field": "color" + "field": "color.keyword" }, "aggs": { <1> "avg_price": { <2> @@ -53,6 +53,7 @@ and what field we want the average to be calculated on (`price`): ... "aggregations": { "colors": { + ... "buckets": [ { "key": "red", From c72cabf5160fb45b2f23d43064b2e1cc168741a2 Mon Sep 17 00:00:00 2001 From: dibbdob Date: Tue, 11 Apr 2017 11:53:26 +0100 Subject: [PATCH 065/107] Complicated -> complex in aggregations tutorial 'Complicated' suggests something is difficult to understand - contradicting the rest of the sentence. In this case, 'Complex' seems more appropriate. --- 010_Intro/35_Tutorial_Aggregations.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/010_Intro/35_Tutorial_Aggregations.asciidoc b/010_Intro/35_Tutorial_Aggregations.asciidoc index 47429874c..4890b231f 100644 --- a/010_Intro/35_Tutorial_Aggregations.asciidoc +++ b/010_Intro/35_Tutorial_Aggregations.asciidoc @@ -114,7 +114,7 @@ GET /megacorp/employee/_search -------------------------------------------------- // SENSE: 010_Intro/35_Aggregations.json -The aggregations that we get back are a bit more complicated, but still fairly +The aggregations that we get back are a bit more complex, but still fairly easy to understand: [source,js] From 7424556788deab6a55933bae1bb16731873ad9c1 Mon Sep 17 00:00:00 2001 From: Rob Moore Date: Tue, 11 Apr 2017 12:02:50 +0100 Subject: [PATCH 066/107] Use possessive its --- 080_Structured_Search/40_bitsets.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/080_Structured_Search/40_bitsets.asciidoc b/080_Structured_Search/40_bitsets.asciidoc index 38a690bac..a193e4d2e 100644 --- a/080_Structured_Search/40_bitsets.asciidoc +++ b/080_Structured_Search/40_bitsets.asciidoc @@ -24,7 +24,7 @@ search requests. It is not dependent on the "context" of the surrounding query. This allows caching to accelerate the most frequently used portions of your queries, without wasting overhead on the less frequent / more volatile portions. -Similarly, if a single search request reuses the same non-scoring query, it's +Similarly, if a single search request reuses the same non-scoring query, its cached bitset can be reused for all instances inside the single search request. Let's look at this example query, which looks for emails that are either of the following: From 478b25a14df3a8a08359cb3e880ffc922ae0a70a Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 11 Apr 2017 13:50:24 +0200 Subject: [PATCH 067/107] Complete sentence about size parameter in 'Aggregation Test-Drive' --- 300_Aggregations/20_basic_example.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/300_Aggregations/20_basic_example.asciidoc b/300_Aggregations/20_basic_example.asciidoc index a8759a384..e9fc8c2e3 100644 --- a/300_Aggregations/20_basic_example.asciidoc +++ b/300_Aggregations/20_basic_example.asciidoc @@ -120,7 +120,7 @@ Let's execute that aggregation and take a look at the results: } } -------------------------------------------------- -<1> No search hits are returned because we set the `size` parameter +<1> No search hits are returned because we set the `size` parameter to zero. <2> Our `popular_colors` aggregation is returned as part of the `aggregations` field. <3> The `key` to each bucket corresponds to a unique term found in the `color` field. It also always includes `doc_count`, which tells us the number of docs containing the term. From ada9278dd53ce8e8b16aef0cdfe72b7ef187c2db Mon Sep 17 00:00:00 2001 From: Catherine Snow Date: Tue, 11 Apr 2017 08:27:40 -0400 Subject: [PATCH 068/107] Update text for grammar (#662) It's not that I'm an anti-descriptivist, but we have many ways to express this sentiment while actually begging the question really only has the one. --- 300_Aggregations/95_analyzed_vs_not.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/300_Aggregations/95_analyzed_vs_not.asciidoc b/300_Aggregations/95_analyzed_vs_not.asciidoc index 088a15d5b..9278b6faf 100644 --- a/300_Aggregations/95_analyzed_vs_not.asciidoc +++ b/300_Aggregations/95_analyzed_vs_not.asciidoc @@ -3,7 +3,7 @@ === Aggregations and Analysis Some aggregations, such as the `terms` bucket, operate((("analysis", "aggregations and")))((("aggregations", "and analysis"))) on string fields. And -string fields may be either `analyzed` or `not_analyzed`, which begs the question: +string fields may be either `analyzed` or `not_analyzed`, which raises the question: how does analysis affect aggregations?((("strings", "analyzed or not_analyzed string fields")))((("not_analyzed fields")))((("analyzed fields"))) The answer is "a lot," for two reasons: analysis affects the tokens used in the aggregation, From cbaed4c2ab34a38a4d1de452a0e7de5405e0e02a Mon Sep 17 00:00:00 2001 From: Edgar Post Date: Tue, 11 Apr 2017 14:35:15 +0200 Subject: [PATCH 069/107] Correct wrong use of possessive 'its' (#646) --- 080_Structured_Search/05_term.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/080_Structured_Search/05_term.asciidoc b/080_Structured_Search/05_term.asciidoc index b65350536..0ed13335d 100644 --- a/080_Structured_Search/05_term.asciidoc +++ b/080_Structured_Search/05_term.asciidoc @@ -302,7 +302,7 @@ bitset is iterated on first (since it excludes the largest number of documents). 4. _Increment the usage counter_. + -Elasticsearch can cache non-scoring queries for faster access, but its silly to +Elasticsearch can cache non-scoring queries for faster access, but it's silly to cache something that is used only rarely. Non-scoring queries are already quite fast due to the inverted index, so we only want to cache queries we _know_ will be used again in the future to prevent resource wastage. From 55d9a0024db15142e49ea357ade877fe3f80cdc5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Paulius=20Aleksi=C5=ABnas?= Date: Tue, 11 Apr 2017 15:38:32 +0300 Subject: [PATCH 070/107] Fix typo (#630) --- 520_Post_Deployment/50_backup.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/520_Post_Deployment/50_backup.asciidoc b/520_Post_Deployment/50_backup.asciidoc index 7f7d63f69..b7b92efb7 100644 --- a/520_Post_Deployment/50_backup.asciidoc +++ b/520_Post_Deployment/50_backup.asciidoc @@ -137,7 +137,7 @@ Once you start accumulating snapshots in your repository, you may forget the det relating to each--particularly when the snapshots are named based on time demarcations (for example, `backup_2014_10_28`). -To obtain information about a single snapshot, simply issue a `GET` reguest against +To obtain information about a single snapshot, simply issue a `GET` request against the repo and snapshot name: [source,js] From 93f14acb4035721a395e0056e3f8ea9622b4a20e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Paulius=20Aleksi=C5=ABnas?= Date: Tue, 11 Apr 2017 15:41:36 +0300 Subject: [PATCH 071/107] Fix typo (#628) --- 320_Geohashes/40_Geohashes.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/320_Geohashes/40_Geohashes.asciidoc b/320_Geohashes/40_Geohashes.asciidoc index 93b76bc40..0c154e184 100644 --- a/320_Geohashes/40_Geohashes.asciidoc +++ b/320_Geohashes/40_Geohashes.asciidoc @@ -10,7 +10,7 @@ URL-friendly way of specifying geolocations, but geohashes have turned out to be a useful way of indexing geo-points and geo-shapes in databases. Geohashes divide the world into a grid of 32 cells--4 rows and 8 columns--each represented by a letter or number. The `g` cell covers half of -Greenland, all of Iceland, and most of Great Britian. Each cell can be further +Greenland, all of Iceland, and most of Great Britain. Each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on. The `gc` cell covers Ireland and England, `gcp` covers most of London and part of Southern England, and `gcpuuz94k` is the entrance to From 2e1085dee491e52ba5a6528bff26448794ea45a1 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 11 Apr 2017 15:09:13 +0200 Subject: [PATCH 072/107] Remove duplicate double-colon Closes #357 --- 270_Fuzzy_matching/20_Fuzziness.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/270_Fuzzy_matching/20_Fuzziness.asciidoc b/270_Fuzzy_matching/20_Fuzziness.asciidoc index 4a6048493..5a7051bfe 100644 --- a/270_Fuzzy_matching/20_Fuzziness.asciidoc +++ b/270_Fuzzy_matching/20_Fuzziness.asciidoc @@ -13,7 +13,7 @@ one word into the other. He proposed three types of one-character edits: * _Insertion_ of a new character: sic -> sic_k_ -* _Deletion_ of a character:: b_l_ack -> back +* _Deletion_ of a character: b_l_ack -> back http://en.wikipedia.org/wiki/Frederick_J._Damerau[Frederick Damerau] later expanded these operations ((("Damerau, Frederick J.")))to include one more: From 7fad216f895b3e69dd8312bf401b634b7ba2c4ce Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 11 Apr 2017 16:04:52 +0200 Subject: [PATCH 073/107] Add initial readme and contribution guide --- CONTRIBUTING.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 41 +++++++++++++++++++++++++++++ 2 files changed, 109 insertions(+) create mode 100644 CONTRIBUTING.md create mode 100644 README.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 000000000..b494673e5 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,68 @@ +## Contributing to the Definitive Guide + +### Contributing documentation changes + +If you have a change that you would like to contribute, please find or open an +issue about it first. Talk about what you would like to do. It may be that +somebody is already working on it, or that there are particular issues that +you should know about before doing the change. + +The process for contributing to any of the [Elastic repositories](https://github.com/elastic/) +is similar. Details can be found below. + +### Fork and clone the repository + +You will need to fork the main repository and clone it to your local machine. +See the respective [Github help page](https://help.github.com/articles/fork-a-repo) +for help. + +### Submitting your changes + +Once your changes and tests are ready to submit for review: + +1. Test your changes + + [Build the complete book locally](https://github.com/elastic/elasticsearch-definitive-guide) + and check and correct any errors that you encounter. + +2. Sign the Contributor License Agreement + + Please make sure you have signed our [Contributor License Agreement](https://www.elastic.co/contributor-agreement/). + We are not asking you to assign copyright to us, but to give us the right + to distribute your code without restriction. We ask this of all + contributors in order to assure our users of the origin and continuing + existence of the code. You only need to sign the CLA once. + +3. Rebase your changes + + Update your local repository with the most recent code from the main + repository, and rebase your branch on top of the latest `master` branch. + We prefer your initial changes to be squashed into a single commit. Later, + if we ask you to make changes, add them as separate commits. This makes + them easier to review. As a final step before merging we will either ask + you to squash all commits yourself or we'll do it for you. + + +4. Submit a pull request + + Push your local changes to your forked copy of the repository and + [submit a pull request](https://help.github.com/articles/using-pull-requests). + In the pull request, choose a title which sums up the changes that you + have made, and in the body provide more details about what your changes do. + Also mention the number of the issue where discussion has taken place, + e.g. "Closes #123". + +Then sit back and wait. There will probably be discussion about the pull +request and, if any changes are needed, we would love to work with you to get +your pull request merged. + +Please adhere to the general guideline that you should never force push +to a publicly shared branch. Once you have opened your pull request, you +should consider your branch publicly shared. Instead of force pushing +you can just add incremental commits; this is generally easier on your +reviewers. If you need to pick up changes from master, you can merge +master into your branch. A reviewer might ask you to rebase a +long-running pull request in which case force pushing is okay for that +request. Note that squashing at the end of the review process should +also not be done, that can be done when the pull request is [integrated +via GitHub](https://github.com/blog/2141-squash-your-commits). \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 000000000..ca4944704 --- /dev/null +++ b/README.md @@ -0,0 +1,41 @@ +# The Definitive Guide to Elasticsearch + +This repository contains the sources to the "Definitive Guide to Elasticsearch" which you can [read online](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html). + +## Building the Definitive Guide + +In order to build this project, we rely on our [docs infrastructure](https://github.com/elastic/docs). + +To build the HTML of the complete project, run the following commands: + +``` +# clone this repo +git clone git@github.com:elastic/elasticsearch-definitive-guide.git +# clone the docs build infrastructure +git clone git@github.com:elastic/docs.git +# Build HTML and open a browser +cd elasticsearch-definitive-guide +../docs/build_docs.pl --doc book.asciidoc --open +``` + +This assumes that you have all necessary prerequisites installed. For a more complete reference, please see refer to the [README in the docs repo](https://github.com/elastic/docs). + +The Definitive Guide is written in Asciidoc and the docs repo also contains a [short Asciidoc guide](https://github.com/elastic/docs#asciidoc-guide). + +## Supported versions + +The Definitive Guide is available for multiple versions of Elasticsearch: + +* The [branch `1.x`](https://github.com/elastic/elasticsearch-definitive-guide/tree/1.x) applies to Elasticsearch 1.x +* The [branch `2.x`](https://github.com/elastic/elasticsearch-definitive-guide/tree/2.x) applies to Elasticsearch 2.x +* The [branch `master`](https://github.com/elastic/elasticsearch-definitive-guide/tree/2.x) applies to master branch of Elasticsearch (the current development version) + +## Contributing + +Before contributing a change please read our [contribution guide](CONTRIBUTING.md). + +## License + +This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. + +See http://creativecommons.org/licenses/by-nc-nd/3.0/ for the full text of the License. \ No newline at end of file From bea41bd3f72f120d8c2c92af4ccff1c845a55068 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 11 Apr 2017 16:14:44 +0200 Subject: [PATCH 074/107] Correct links and small typo in README/CONTRIBUTING --- CONTRIBUTING.md | 2 +- README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b494673e5..827c03b97 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -22,7 +22,7 @@ Once your changes and tests are ready to submit for review: 1. Test your changes - [Build the complete book locally](https://github.com/elastic/elasticsearch-definitive-guide) + [Build the complete book locally](https://github.com/elastic/elasticsearch-definitive-guide#building-the-definitive-guide) and check and correct any errors that you encounter. 2. Sign the Contributor License Agreement diff --git a/README.md b/README.md index ca4944704..0880c5dae 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ cd elasticsearch-definitive-guide ../docs/build_docs.pl --doc book.asciidoc --open ``` -This assumes that you have all necessary prerequisites installed. For a more complete reference, please see refer to the [README in the docs repo](https://github.com/elastic/docs). +This assumes that you have all necessary prerequisites installed. For a more complete reference, please refer to the [README in the docs repo](https://github.com/elastic/docs). The Definitive Guide is written in Asciidoc and the docs repo also contains a [short Asciidoc guide](https://github.com/elastic/docs#asciidoc-guide). From c513749bed732a8e7be9c7a910e2cd3684e77187 Mon Sep 17 00:00:00 2001 From: debadair Date: Tue, 11 Apr 2017 08:44:46 -0700 Subject: [PATCH 075/107] Added line length & code indent style notes to the contributor's guide. --- CONTRIBUTING.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 827c03b97..1682f9f3a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,9 +3,12 @@ ### Contributing documentation changes If you have a change that you would like to contribute, please find or open an -issue about it first. Talk about what you would like to do. It may be that +issue about it first. Talk about what you would like to do. It might be that somebody is already working on it, or that there are particular issues that -you should know about before doing the change. +you should know about before making the change. + +Where possible, stick to an 80 character line length in the asciidoc source +files. Do not exceed 120 characters. Use 2 space indents in code examples. The process for contributing to any of the [Elastic repositories](https://github.com/elastic/) is similar. Details can be found below. From e2067323029c3807269df26cbad5c21cc5ca815f Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Mon, 17 Apr 2017 23:18:17 -0700 Subject: [PATCH 076/107] Don't explain 1.x count request parameter (#670) --- 300_Aggregations/20_basic_example.asciidoc | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/300_Aggregations/20_basic_example.asciidoc b/300_Aggregations/20_basic_example.asciidoc index e9fc8c2e3..4314ef6da 100644 --- a/300_Aggregations/20_basic_example.asciidoc +++ b/300_Aggregations/20_basic_example.asciidoc @@ -74,9 +74,7 @@ in <<_scoping_aggregations>>. ========================= You'll notice that we set the `size` to zero. We don't care about the search results themselves and -returning zero hits speeds up the query. Setting -`size: 0` is the equivalent of using the `count` -search type in Elasticsearch 1.x. +returning zero hits speeds up the query. ========================= Next we define a name for our aggregation. Naming is up to you; From 891f78cfb7891b67ad8cca56387685a1e7556777 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Mon, 17 Apr 2017 23:19:31 -0700 Subject: [PATCH 077/107] Use keyword subfield for aggregations in intro chapter (#668) With this commit we correct use `keyword` subfields in all aggregation-related examples in the intro chapter. They are needed since Elasticsearch 5.0 as text-fields do not have fielddata enabled. Relates elastic/elasticsearch#17188 --- 010_Intro/35_Tutorial_Aggregations.asciidoc | 17 +++++++++++------ snippets/010_Intro/35_Aggregations.json | 8 +++++--- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/010_Intro/35_Tutorial_Aggregations.asciidoc b/010_Intro/35_Tutorial_Aggregations.asciidoc index 4890b231f..8932f4729 100644 --- a/010_Intro/35_Tutorial_Aggregations.asciidoc +++ b/010_Intro/35_Tutorial_Aggregations.asciidoc @@ -13,7 +13,7 @@ GET /megacorp/employee/_search { "aggs": { "all_interests": { - "terms": { "field": "interests" } + "terms": { "field": "interests.keyword" } } } } @@ -29,17 +29,18 @@ Ignore the syntax for now and just look at the results: "hits": { ... }, "aggregations": { "all_interests": { + ... "buckets": [ { - "key": "music", + "key": "music", "doc_count": 2 }, { - "key": "forestry", + "key": "forestry", "doc_count": 1 }, { - "key": "sports", + "key": "sports", "doc_count": 1 } ] @@ -66,7 +67,7 @@ GET /megacorp/employee/_search "aggs": { "all_interests": { "terms": { - "field": "interests" + "field": "interests.keyword" } } } @@ -80,6 +81,7 @@ The `all_interests` aggregation has changed to include only documents matching o -------------------------------------------------- ... "all_interests": { + ... "buckets": [ { "key": "music", @@ -102,7 +104,9 @@ GET /megacorp/employee/_search { "aggs" : { "all_interests" : { - "terms" : { "field" : "interests" }, + "terms" : { + "field" : "interests.keyword" + }, "aggs" : { "avg_age" : { "avg" : { "field" : "age" } @@ -121,6 +125,7 @@ easy to understand: -------------------------------------------------- ... "all_interests": { + ... "buckets": [ { "key": "music", diff --git a/snippets/010_Intro/35_Aggregations.json b/snippets/010_Intro/35_Aggregations.json index d4bd9e62d..46ed211a8 100644 --- a/snippets/010_Intro/35_Aggregations.json +++ b/snippets/010_Intro/35_Aggregations.json @@ -35,7 +35,7 @@ GET /megacorp/employee/_search "aggs": { "all_interests": { "terms": { - "field": "interests" + "field": "interests.keyword" } } } @@ -53,7 +53,7 @@ GET /megacorp/employee/_search "aggs": { "all_interests": { "terms": { - "field": "interests" + "field": "interests.keyword" } } } @@ -64,7 +64,9 @@ GET /megacorp/employee/_search { "aggs" : { "all_interests" : { - "terms" : { "field" : "interests" }, + "terms" : { + "field" : "interests.keyword" + }, "aggs" : { "avg_age" : { "avg" : { "field" : "age" } From aa4f759ac23e30595726d9ba821994fc5df82765 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Fri, 21 Apr 2017 15:49:02 +0200 Subject: [PATCH 078/107] Add script to generate taxi aggregation sample (WIP) --- scripts/300_Aggregations/generate.py | 165 +++++++++++++++++++++++++ scripts/300_Aggregations/import.py | 49 ++++++++ scripts/300_Aggregations/mappings.json | 46 +++++++ 3 files changed, 260 insertions(+) create mode 100755 scripts/300_Aggregations/generate.py create mode 100755 scripts/300_Aggregations/import.py create mode 100644 scripts/300_Aggregations/mappings.json diff --git a/scripts/300_Aggregations/generate.py b/scripts/300_Aggregations/generate.py new file mode 100755 index 000000000..0866415ac --- /dev/null +++ b/scripts/300_Aggregations/generate.py @@ -0,0 +1,165 @@ +#!/usr/bin/env python3 + +import json +import sys +import random + +vendors = [ + "Yellow", + "Green", + "Blue", + "Red", + "Black" +] + +all_zones = [ + "Castro District", + "Chinatown", + "Cole Valley", + "Financial District", + "Fisherman's Wharf", + "Haight-Ashbury", + "Hayes Valley", + "Japantown", + "Lower Haight", + "Marina", + "Mission District", + "Nob Hill", + "Noe Valley", + "North Beach", + "Pacific Heights", + "Panhandle", + "Potrero Hill", + "Presidio", + "Richmond", + "Russian Hill", + "Sea Cliff", + "Sixth Street", + "SOMA", + "Sunset", + "Tenderloin", + "Union Square", + "Upper Market" +] + +zones = [ + "Chinatown", + "Financial District", + "Haight-Ashbury", + "Presidio", + "Sunset" +] + +minutes_per_mile = 5 + +distances_in_miles = { + "Chinatown": { + "Chinatown": 0, + "Financial District": 1, + "Haight-Ashbury": 4, + "Presidio": 4, + "Sunset": 9 + }, + "Financial District": { + "Chinatown": 1, + "Financial District": 0, + "Haight-Ashbury": 4, + "Presidio": 4, + "Sunset": 7 + }, + "Haight-Ashbury": { + "Chinatown": 4, + "Financial District": 4, + "Haight-Ashbury": 0, + "Presidio": 3, + "Sunset": 4 + }, + "Presidio": { + "Chinatown": 4, + "Financial District": 4, + "Haight-Ashbury": 3, + "Presidio": 0, + "Sunset": 5 + }, + "Sunset": { + "Chinatown": 9, + "Financial District": 7, + "Haight-Ashbury": 4, + "Presidio": 5, + "Sunset": 0 + } +} + + +def payment_type(): + v = random.uniform(0, 10) + if v < 7: + return "Credit card" + elif v < 9.5: + return "Cash" + else: + return "No charge" + + +def vendor(): + return random.choice(vendors) + + +def passengers(): + return min(6, max(1, round(random.lognormvariate(mu=0, sigma=1)))) + + +def distance(start, end): + base = distances_in_miles[start][end] + return base + 0.2 * random.randint(0, max(base, 1)) + + +def fare(trip_distance): + # loosely based on https://www.sfmta.com/getting-around/taxi/taxi-rates + # assume a random waiting time up to 10% of the distance + waiting_time_factor = 0.55 * 0.1 * random.randint(0, round(trip_distance)) + units = max(0, round(trip_distance / 0.125) - 1) + return 3.5 + units * 0.55 + + +def tip(fare_amount): + # up to 20% tip + return 0.2 * random.randint(0, round(fare_amount)) + +def round_f(v): + return float("{0:.2f}".format(v)) + + +def main(): + if len(sys.argv) != 2: + print("usage: %s number_of_records_to_generate" % sys.argv[0]) + exit(1) + + num_records = int(sys.argv[1]) + for i in range(num_records): + record = {} + record["vendor"] = vendor() + # TODO: Find a simple but somewhat realistic model for daily / weekly patterns + # record["pickup_datetime"] = pickup_datetime + # record["dropoff_datetime"] = dropoff_datetime + record["passenger_count"] = passengers() + + start = random.choice(zones) + end = random.choice(zones) + trip_distance = distance(start, end) + + record["pickup_zone"] = start + record["dropoff_zone"] = end + record["payment_type"] = payment_type() + record["trip_distance"] = round_f(trip_distance) + fare_amount = round_f(fare(trip_distance)) + tip_amount = round_f(tip(fare_amount)) + record["fare_amount"] = fare_amount + record["tip_amount"] = tip_amount + record["total_amount"] = round_f(fare_amount + tip_amount) + + print(json.dumps(record)) + + +if __name__ == '__main__': + main() diff --git a/scripts/300_Aggregations/import.py b/scripts/300_Aggregations/import.py new file mode 100755 index 000000000..13d0bb8c1 --- /dev/null +++ b/scripts/300_Aggregations/import.py @@ -0,0 +1,49 @@ +#!/usr/bin/env python3 + +import elasticsearch +import elasticsearch.helpers +import json +import logging +import sys +import itertools + +logger = logging.getLogger("import") + +index_name = "taxis" +type_name = "rides" + + +def create_index(client, mapping_file): + if client.indices.exists(index=index_name): + logger.info("Index [%s] already exists. Deleting it." % index_name) + client.indices.delete(index=index_name) + logger.info("Creating index [%s]" % index_name) + client.indices.create(index=index_name, body='{"index.number_of_replicas": 0}') + with open(mapping_file, "rt") as f: + mappings = f.read() + client.indices.put_mapping(index=index_name, + doc_type=type_name, + body=json.loads(mappings)) + + +def import_data(client, data_file): + meta_data = '{"_op_type": "index", "_index": "%s", "_type": "%s"}' % (index_name, type_name) + with open(data_file, "rt") as f: + elasticsearch.helpers.bulk(client, f, index=index_name, doc_type=type_name) + + +def main(): + if len(sys.argv) != 3: + print("usage %s mapping_file_path data_file_path" % sys.argv[0]) + exit(1) + + es = elasticsearch.Elasticsearch() + mapping_file = sys.argv[1] + data_file = sys.argv[2] + + create_index(es, mapping_file) + import_data(es, data_file) + + +if __name__ == '__main__': + main() diff --git a/scripts/300_Aggregations/mappings.json b/scripts/300_Aggregations/mappings.json new file mode 100644 index 000000000..72604e691 --- /dev/null +++ b/scripts/300_Aggregations/mappings.json @@ -0,0 +1,46 @@ +{ + "rides": { + "properties": { + "vendor": { + "type": "keyword" + }, + "pickup_datetime": { + "type": "date", + "format": "yyyy-MM-dd HH:mm:ss" + }, + "dropoff_datetime": { + "type": "date", + "format": "yyyy-MM-dd HH:mm:ss" + }, + "passenger_count": { + "type": "integer" + }, + "trip_distance": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "pickup_zone": { + "type": "keyword" + }, + "dropoff_zone": { + "type": "keyword" + }, + "payment_type": { + "type": "keyword" + }, + "fare_amount": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "tip_amount": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "total_amount": { + "scaling_factor": 100, + "type": "scaled_float" + } + }, + "dynamic": "strict" + } +} From 5a8ba3188f72d4a66158bf47d8430ae9ba6286d4 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Fri, 21 Apr 2017 15:53:12 +0200 Subject: [PATCH 079/107] Add a simple README that explains how to use the aggregation scripts --- scripts/300_Aggregations/README.md | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 scripts/300_Aggregations/README.md diff --git a/scripts/300_Aggregations/README.md b/scripts/300_Aggregations/README.md new file mode 100644 index 000000000..bd339c868 --- /dev/null +++ b/scripts/300_Aggregations/README.md @@ -0,0 +1,3 @@ +This directory contains two scripts that can be used to generate a taxi example data set. They require Python 3 and for `import.py` you must also have the Elasticsearch Python client installed (`pip3 install elasticsearch`). + +Run `./generate.py 100 > documents.json` to generate 100 random taxi rides. You can import them into a local Elasticsearch cluster (5.x or 6.0) by running `./import.py mappings.json documents.json`. From dc1a89b5e27644971c764f57a028d4c311a8b207 Mon Sep 17 00:00:00 2001 From: Mike Baamonde Date: Tue, 25 Apr 2017 10:46:29 -0400 Subject: [PATCH 080/107] Formatting fixes for the Cluster Administration chapter. This commit removes glossary indexing annotations and extraneous whitespace. It also enforces 80-character line length. --- 500_Cluster_Admin/10_intro.asciidoc | 19 +- 500_Cluster_Admin/20_health.asciidoc | 74 ++-- 500_Cluster_Admin/30_node_stats.asciidoc | 391 +++++++++--------- 500_Cluster_Admin/40_other_stats.asciidoc | 300 +++++++------- 510_Deployment/10_intro.asciidoc | 13 +- 510_Deployment/20_hardware.asciidoc | 132 +++--- 510_Deployment/30_other.asciidoc | 94 ++--- 510_Deployment/40_config.asciidoc | 226 +++++----- 510_Deployment/45_dont_touch.asciidoc | 96 +++-- 510_Deployment/50_heap.asciidoc | 191 ++++----- 510_Deployment/60_file_descriptors.asciidoc | 39 +- 510_Deployment/70_conclusion.asciidoc | 20 +- .../10_dynamic_settings.asciidoc | 20 +- 520_Post_Deployment/20_logging.asciidoc | 35 +- 520_Post_Deployment/30_indexing_perf.asciidoc | 165 ++++---- .../35_delayed_shard_allocation.asciidoc | 95 ++--- .../40_rolling_restart.asciidoc | 46 +-- 520_Post_Deployment/50_backup.asciidoc | 137 +++--- 520_Post_Deployment/60_restore.asciidoc | 57 ++- 520_Post_Deployment/70_conclusion.asciidoc | 33 +- 20 files changed, 1106 insertions(+), 1077 deletions(-) diff --git a/500_Cluster_Admin/10_intro.asciidoc b/500_Cluster_Admin/10_intro.asciidoc index e9517685d..22a1d1cca 100644 --- a/500_Cluster_Admin/10_intro.asciidoc +++ b/500_Cluster_Admin/10_intro.asciidoc @@ -1,15 +1,16 @@ -Elasticsearch is often deployed as a cluster of nodes.((("clusters", "administration"))) A variety of -APIs let you manage and monitor the cluster itself, rather than interact -with the data stored within the cluster. +Elasticsearch is often deployed as a cluster of nodes. A variety of APIs let you +manage and monitor the cluster itself, rather than interact with the data stored +within the cluster. As with most functionality in Elasticsearch, there is an overarching design goal that tasks should be performed through an API rather than by modifying static -configuration files. This becomes especially important as your cluster scales. -Even with a provisioning system (such as Puppet, Chef, and Ansible), a single HTTP API call -is often simpler than pushing new configurations to hundreds of physical machines. +configuration files. This becomes especially important as your cluster scales. +Even with a provisioning system (such as Puppet, Chef, and Ansible), a single +HTTP API call is often simpler than pushing new configurations to hundreds of +physical machines. To that end, this chapter presents the various APIs that allow you to -dynamically tweak, tune, and configure your cluster. It also covers a -host of APIs that provide statistics about the cluster itself so you can -monitor for health and performance. +dynamically tweak, tune, and configure your cluster. It also covers a host of +APIs that provide statistics about the cluster itself so you can monitor for +health and performance. diff --git a/500_Cluster_Admin/20_health.asciidoc b/500_Cluster_Admin/20_health.asciidoc index 2b5e636b6..50be427da 100644 --- a/500_Cluster_Admin/20_health.asciidoc +++ b/500_Cluster_Admin/20_health.asciidoc @@ -1,13 +1,14 @@ === Cluster Health -An Elasticsearch cluster may consist of a single node with a single index. Or it((("cluster health")))((("clusters", "administration", "Cluster Health API"))) -may have a hundred data nodes, three dedicated masters, a few dozen client nodes--all operating on a thousand indices (and tens of thousands of shards). +An Elasticsearch cluster may consist of a single node with a single index. Or it +may have a hundred data nodes, three dedicated masters, a few dozen client +nodes--all operating on a thousand indices (and tens of thousands of shards). No matter the scale of the cluster, you'll want a quick way to assess the status -of your cluster. The `Cluster Health` API fills that role. You can think of it -as a 10,000-foot view of your cluster. It can reassure you that everything -is all right, or alert you to a problem somewhere in your cluster. +of your cluster. The `Cluster Health` API fills that role. You can think of it +as a 10,000-foot view of your cluster. It can reassure you that everything is +all right, or alert you to a problem somewhere in your cluster. Let's execute a `cluster-health` API and see what the response looks like: @@ -45,7 +46,7 @@ operational. `yellow`:: All primary shards are allocated, but at least one replica is missing. -No data is missing, so search results will still be complete. However, your +No data is missing, so search results will still be complete. However, your high availability is compromised to some degree. If _more_ shards disappear, you might lose data. Think of `yellow` as a warning that should prompt investigation. @@ -66,10 +67,10 @@ includes replica shards. one node to another node. This number is often zero, but can increase when Elasticsearch decides a cluster is not properly balanced, a new node is added, or a node is taken down, for example. -- `initializing_shards` is a count of shards that are being freshly created. For +- `initializing_shards` is a count of shards that are being freshly created. For example, when you first create an index, the shards will all briefly reside in `initializing` state. This is typically a transient event, and shards shouldn't -linger in `initializing` too long. You may also see initializing shards when a +linger in `initializing` too long. You may also see initializing shards when a node is first restarted: as shards are loaded from disk, they start as `initializing`. - `unassigned_shards` are shards that exist in the cluster state, but cannot be found in the cluster itself. A common source of unassigned shards are unassigned @@ -79,7 +80,7 @@ cluster is `red` (since primaries are missing). ==== Drilling Deeper: Finding Problematic Indices -Imagine something goes wrong one day,((("indices", "problematic, finding"))) and you notice that your cluster health +Imagine something goes wrong one day, and you notice that your cluster health looks like this: [source,js] @@ -98,15 +99,15 @@ looks like this: } ---- -OK, so what can we deduce from this health status? Well, our cluster is `red`, -which means we are missing data (primary + replicas). We know our cluster has -10 nodes, but see only 8 data nodes listed in the health. Two of our nodes -have gone missing. We see that there are 20 unassigned shards. +OK, so what can we deduce from this health status? Well, our cluster is `red`, +which means we are missing data (primary + replicas). We know our cluster has 10 +nodes, but see only 8 data nodes listed in the health. Two of our nodes have +gone missing. We see that there are 20 unassigned shards. That's about all the information we can glean. The nature of those missing shards are still a mystery. Are we missing 20 indices with 1 primary shard each? Or 1 index with 20 primary shards? Or 10 indices with 1 primary + 1 replica? -Which index? +Which index? To answer these questions, we need to ask `cluster-health` for a little more information by using the `level` parameter: @@ -183,40 +184,43 @@ The `level` parameter accepts one more option: GET _cluster/health?level=shards ---- -The `shards` option will provide a very verbose output, which lists the status +The `shards` option will provide a very verbose output, which lists the status and location of every shard inside every index. This output is sometimes useful, but because of the verbosity can be difficult to work with. Once you know the index -that is having problems, other APIs that we discuss in this chapter will tend +that is having problems, other APIs that we discuss in this chapter will tend to be more helpful. ==== Blocking for Status Changes The `cluster-health` API has another neat trick that is useful when building unit and integration tests, or automated scripts that work with Elasticsearch. -You can specify a `wait_for_status` parameter, which will only return after the status is satisfied. For example: +You can specify a `wait_for_status` parameter, which will only return after the +status is satisfied. For example: [source,bash] ---- GET _cluster/health?wait_for_status=green ---- -This call will _block_ (not return control to your program) until the `cluster-health` has turned `green`, meaning all primary and replica shards have been allocated. -This is important for automated scripts and tests. +This call will _block_ (not return control to your program) until the +`cluster-health` has turned `green`, meaning all primary and replica shards have +been allocated. This is important for automated scripts and tests. If you create an index, Elasticsearch must broadcast the change in cluster state -to all nodes. Those nodes must initialize those new shards, and then respond to the -master that the shards are `Started`. This process is fast, but because of network -latency may take 10–20ms. - -If you have an automated script that (a) creates an index and then (b) immediately -attempts to index a document, this operation may fail, because the index has not -been fully initialized yet. The time between (a) and (b) will likely be less than 1ms--not nearly enough time to account for network latency. - -Rather than sleeping, just have your script/test call `cluster-health` with -a `wait_for_status` parameter. As soon as the index is fully created, the `cluster-health` will change to `green`, the call will return control to your script, and you may -begin indexing. - -Valid options are `green`, `yellow`, and `red`. The call will return when the -requested status (or one "higher") is reached. For example, if you request `yellow`, -a status change to `yellow` or `green` will unblock the call. - +to all nodes. Those nodes must initialize those new shards, and then respond to +the master that the shards are `Started`. This process is fast, but because of +network latency may take 10–20ms. + +If you have an automated script that (a) creates an index and then (b) +immediately attempts to index a document, this operation may fail, because the +index has not been fully initialized yet. The time between (a) and (b) will +likely be less than 1ms--not nearly enough time to account for network latency. + +Rather than sleeping, just have your script/test call `cluster-health` with a +`wait_for_status` parameter. As soon as the index is fully created, the +`cluster-health` will change to `green`, the call will return control to your +script, and you may begin indexing. + +Valid options are `green`, `yellow`, and `red`. The call will return when the +requested status (or one "higher") is reached. For example, if you request +`yellow`, a status change to `yellow` or `green` will unblock the call. diff --git a/500_Cluster_Admin/30_node_stats.asciidoc b/500_Cluster_Admin/30_node_stats.asciidoc index a77745c8e..250a14e64 100644 --- a/500_Cluster_Admin/30_node_stats.asciidoc +++ b/500_Cluster_Admin/30_node_stats.asciidoc @@ -2,14 +2,14 @@ === Monitoring Individual Nodes `Cluster-health` is at one end of the spectrum--a very high-level overview of -everything in your cluster. ((("clusters", "administration", "monitoring individual nodes")))((("nodes", "monitoring individual nodes"))) The `node-stats` API is at the other end. ((("Node Stats API", id="ix_NodeStats", range="startofrange"))) It provides -a bewildering array of statistics about each node in your cluster. +everything in your cluster. The `node-stats` API is at the other end. It +provides a bewildering array of statistics about each node in your cluster. -`Node-stats` provides so many stats that, until you are accustomed to the output, -you may be unsure which metrics are most important to keep an eye on. We'll -highlight the most important metrics to monitor (but we encourage you to -log all the metrics provided--or use Marvel--because you'll never know when -you need one stat or another). +`Node-stats` provides so many stats that, until you are accustomed to the +output, you may be unsure which metrics are most important to keep an eye on. +We'll highlight the most important metrics to monitor (but we encourage you to +log all the metrics provided--or use Marvel--because you'll never know when you +need one stat or another.) The `node-stats` API can be executed with the following: @@ -37,15 +37,15 @@ Starting at the top of the output, we see the cluster name and our first node: ... ---- -The nodes are listed in a hash, with the key being the UUID of the node. Some -information about the node's network properties are displayed (such as transport address, -and host). These values are useful for debugging discovery problems, where -nodes won't join the cluster. Often you'll see that the port being used is wrong, -or the node is binding to the wrong IP address/interface. +The nodes are listed in a hash, with the key being the UUID of the node. Some +information about the node's network properties are displayed (such as transport +address, and host). These values are useful for debugging discovery problems, +where nodes won't join the cluster. Often you'll see that the port being used is +wrong, or the node is binding to the wrong IP address/interface. ==== indices Section -The `indices` section lists aggregate statistics((("indices", "indices section in Node Stats API"))) for all the indices that reside +The `indices` section lists aggregate statistics for all the indices that reside on this particular node: [source,js] @@ -63,13 +63,12 @@ on this particular node: The returned statistics are grouped into the following sections: -- `docs` shows how many documents reside on -this node, as well as the number of deleted docs that haven't been purged -from segments yet. +- `docs` shows how many documents reside on this node, as well as the number of +deleted docs that haven't been purged from segments yet. -- The `store` portion indicates how much physical storage is consumed by the node. -This metric includes both primary and replica shards. If the throttle time is -large, it may be an indicator that your disk throttling is set too low +- The `store` portion indicates how much physical storage is consumed by the +node. This metric includes both primary and replica shards. If the throttle time +is large, it may be an indicator that your disk throttling is set too low (discussed in <>). [source,js] @@ -111,35 +110,37 @@ large, it may be an indicator that your disk throttling is set too low }, ---- -- `indexing` shows the number of docs that have been indexed. This value is a monotonically -increasing counter; it doesn't decrease when docs are deleted. Also note that it -is incremented anytime an _index_ operation happens internally, which includes -things like updates. +- `indexing` shows the number of docs that have been indexed. This value is a +monotonically increasing counter; it doesn't decrease when docs are deleted. +Also note that it is incremented anytime an _index_ operation happens +internally, which includes things like updates. + Also listed are times for indexing, the number of docs currently being indexed, and similar statistics for deletes. -- `get` shows statistics about get-by-ID statistics. This includes `GET` and +- `get` shows statistics about get-by-ID statistics. This includes `GET` and `HEAD` requests for a single document. - `search` describes the number of active searches (`open_contexts`), number of queries total, and the amount of time spent on queries since the node was -started. The ratio between `query_time_in_millis / query_total` can be used as a -rough indicator for how efficient your queries are. The larger the ratio, -the more time each query is taking, and you should consider tuning or optimization. +started. The ratio between `query_time_in_millis / query_total` can be used as a +rough indicator for how efficient your queries are. The larger the ratio, the +more time each query is taking, and you should consider tuning or optimization. + The fetch statistics detail the second half of the query process (the _fetch_ in -query-then-fetch). If more time is spent in fetch than query, this can be an -indicator of slow disks or very large documents being fetched, or -potentially search requests with paginations that are too large (for example, `size: 10000`). - -- `merges` contains information about Lucene segment merges. It will tell you -the number of merges that are currently active, the number of docs involved, the cumulative -size of segments being merged, and the amount of time spent on merges in total. +query-then-fetch). If more time is spent in fetch than query, this can be an +indicator of slow disks or very large documents being fetched, or potentially +search requests with paginations that are too large (for example, `size: +10000`). + +- `merges` contains information about Lucene segment merges. It will tell you +the number of merges that are currently active, the number of docs involved, the +cumulative size of segments being merged, and the amount of time spent on merges +in total. + -Merge statistics can be important if your cluster is write heavy. Merging consumes -a large amount of disk I/O and CPU resources. If your index is write heavy and -you see large merge numbers, be sure to read <>. +Merge statistics can be important if your cluster is write heavy. Merging +consumes a large amount of disk I/O and CPU resources. If your index is write +heavy and you see large merge numbers, be sure to read <>. + Note: updates and deletes will contribute to large merge numbers too, since they cause segment _fragmentation_ that needs to be merged out eventually. @@ -161,50 +162,53 @@ cause segment _fragmentation_ that needs to be merged out eventually. ... ---- -- `filter_cache` indicates the amount of memory used by the cached filter bitsets, -and the number of times a filter has been evicted. A large number of evictions -_could_ indicate that you need to increase the filter cache size, or that -your filters are not caching well (for example, they are churning heavily because of high cardinality, -such as caching `now` date expressions). +- `filter_cache` indicates the amount of memory used by the cached filter +bitsets, and the number of times a filter has been evicted. A large number of +evictions _could_ indicate that you need to increase the filter cache size, or +that your filters are not caching well (for example, they are churning heavily +because of high cardinality, such as caching `now` date expressions). + -However, evictions are a difficult metric to evaluate. Filters are cached on a +However, evictions are a difficult metric to evaluate. Filters are cached on a per-segment basis, and evicting a filter from a small segment is much less -expensive than evicting a filter from a large segment. It's possible that you have many evictions, but they all occur on small segments, which means they have +expensive than evicting a filter from a large segment. It's possible that you +have many evictions, but they all occur on small segments, which means they have little impact on query performance. + -Use the eviction metric as a rough guideline. If you see a large number, investigate -your filters to make sure they are caching well. Filters that constantly evict, -even on small segments, will be much less effective than properly cached filters. - -- `field_data` displays the memory used by fielddata,((("fielddata", "statistics on"))) which is used for aggregations, -sorting, and more. There is also an eviction count. Unlike `filter_cache`, the eviction -count here is useful: it should be zero or very close. Since field data -is not a cache, any eviction is costly and should be avoided. If you see -evictions here, you need to reevaluate your memory situation, fielddata limits, -queries, or all three. - -- `segments` will tell you the number of Lucene segments this node currently serves.((("segments", "number served by a node"))) -This can be an important number. Most indices should have around 50–150 segments, -even if they are terabytes in size with billions of documents. Large numbers -of segments can indicate a problem with merging (for example, merging is not keeping up -with segment creation). Note that this statistic is the aggregate total of all -indices on the node, so keep that in mind. +Use the eviction metric as a rough guideline. If you see a large number, +investigate your filters to make sure they are caching well. Filters that +constantly evict, even on small segments, will be much less effective than +properly cached filters. + +- `field_data` displays the memory used by fielddata, which is used for +aggregations, sorting, and more. There is also an eviction count. Unlike +`filter_cache`, the eviction count here is useful: it should be zero or very +close. Since field data is not a cache, any eviction is costly and should be +avoided. If you see evictions here, you need to reevaluate your memory +situation, fielddata limits, queries, or all three. + +- `segments` will tell you the number of Lucene segments this node currently +serves. This can be an important number. Most indices should have around +50–150 segments, even if they are terabytes in size with billions of +documents. Large numbers of segments can indicate a problem with merging (for +example, merging is not keeping up with segment creation). Note that this +statistic is the aggregate total of all indices on the node, so keep that in +mind. + -The `memory` statistic gives you an idea of the amount of memory being used by the -Lucene segments themselves.((("memory", "statistics on"))) This includes low-level data structures such as -posting lists, dictionaries, and bloom filters. A very large number of segments -will increase the amount of overhead lost to these data structures, and the memory -usage can be a handy metric to gauge that overhead. +The `memory` statistic gives you an idea of the amount of memory being used by +the Lucene segments themselves. This includes low-level data structures such as +posting lists, dictionaries, and bloom filters. A very large number of segments +will increase the amount of overhead lost to these data structures, and the +memory usage can be a handy metric to gauge that overhead. ==== OS and Process Sections The `OS` and `Process` sections are fairly self-explanatory and won't be covered -in great detail.((("operating system (OS), statistics on"))) They list basic resource statistics such as CPU and load.((("process (Elasticsearch JVM), statistics on"))) The -`OS` section describes it for the entire `OS`, while the `Process` section shows just -what the Elasticsearch JVM process is using. +in great detail. They list basic resource statistics such as CPU and load. The +`OS` section describes it for the entire `OS`, while the `Process` section shows +just what the Elasticsearch JVM process is using. -These are obviously useful metrics, but are often being measured elsewhere in your -monitoring stack. Some stats include the following: +These are obviously useful metrics, but are often being measured elsewhere in +your monitoring stack. Some stats include the following: - CPU - Load @@ -215,73 +219,75 @@ monitoring stack. Some stats include the following: ==== JVM Section The `jvm` section contains some critical information about the JVM process that -is running Elasticsearch.((("JVM (Java Virtual Machine)", "statistics on"))) Most important, it contains garbage collection details, -which have a large impact on the stability of your Elasticsearch cluster. +is running Elasticsearch. Most important, it contains garbage collection +details, which have a large impact on the stability of your Elasticsearch +cluster. [[garbage_collector_primer]] .Garbage Collection Primer ********************************** Before we describe the stats, it is useful to give a crash course in garbage -collection and its impact on Elasticsearch.((("garbage collection"))) If you are familar with garbage +collection and its impact on Elasticsearch. If you are familar with garbage collection in the JVM, feel free to skip down. -Java is a _garbage-collected_ language, which means that the programmer does -not manually manage memory allocation and deallocation. The programmer simply -writes code, and the Java Virtual Machine (JVM) manages the process of allocating +Java is a _garbage-collected_ language, which means that the programmer does not +manually manage memory allocation and deallocation. The programmer simply writes +code, and the Java Virtual Machine (JVM) manages the process of allocating memory as needed, and then later cleaning up that memory when no longer needed. When memory is allocated to a JVM process, it is allocated in a big chunk called -the _heap_. The JVM then breaks the heap into two groups, referred to as +the _heap_. The JVM then breaks the heap into two groups, referred to as _generations_: Young (or Eden):: - The space where newly instantiated objects are allocated. The -young generation space is often quite small, usually 100 MB–500 MB. The young-gen -also contains two _survivor_ spaces. + The space where newly instantiated objects are allocated. The young +generation space is often quite small, usually 100 MB–500 MB. The +young-gen also contains two _survivor_ spaces. Old:: - The space where older objects are stored. These objects are expected to be long-lived -and persist for a long time. The old-gen is often much larger than the young-gen, -and Elasticsearch nodes can see old-gens as large as 30 GB. + The space where older objects are stored. These objects are expected to be +long-lived and persist for a long time. The old-gen is often much larger than +the young-gen, and Elasticsearch nodes can see old-gens as large as 30 GB. -When an object is instantiated, it is placed into young-gen. When the young -generation space is full, a young-gen garbage collection (GC) is started. Objects that are still -"alive" are moved into one of the survivor spaces, and "dead" objects are removed. -If an object has survived several young-gen GCs, it will be "tenured" into the -old generation. +When an object is instantiated, it is placed into young-gen. When the young +generation space is full, a young-gen garbage collection (GC) is started. +Objects that are still "alive" are moved into one of the survivor spaces, and +"dead" objects are removed. If an object has survived several young-gen GCs, it +will be "tenured" into the old generation. -A similar process happens in the old generation: when the space becomes full, a +A similar process happens in the old generation: when the space becomes full, a garbage collection is started and dead objects are removed. -Nothing comes for free, however. Both the young- and old-generation garbage collectors -have phases that "stop the world." During this time, the JVM literally halts -execution of the program so it can trace the object graph and collect dead -objects. During this stop-the-world phase, nothing happens. Requests are not serviced, -pings are not responded to, shards are not relocated. The world quite literally -stops. +Nothing comes for free, however. Both the young- and old-generation garbage +collectors have phases that "stop the world." During this time, the JVM +literally halts execution of the program so it can trace the object graph and +collect dead objects. During this stop-the-world phase, nothing happens. +Requests are not serviced, pings are not responded to, shards are not relocated. +The world quite literally stops. This isn't a big deal for the young generation; its small size means GCs execute -quickly. But the old-gen is quite a bit larger, and a slow GC here could mean -1s or even 15s of pausing--which is unacceptable for server software. +quickly. But the old-gen is quite a bit larger, and a slow GC here could mean 1s +or even 15s of pausing--which is unacceptable for server software. -The garbage collectors in the JVM are _very_ sophisticated algorithms and do -a great job minimizing pauses. And Elasticsearch tries very hard to be _garbage-collection friendly_, by intelligently reusing objects internally, reusing network -buffers, and enabling <> by default. But ultimately, +The garbage collectors in the JVM are _very_ sophisticated algorithms and do a +great job minimizing pauses. And Elasticsearch tries very hard to be +_garbage-collection friendly_, by intelligently reusing objects internally, +reusing network buffers, and enabling <> by default. But ultimately, GC frequency and duration is a metric that needs to be watched by you, since it is the number one culprit for cluster instability. -A cluster that is frequently experiencing long GC will be a cluster that is under -heavy load with not enough memory. These long GCs will make nodes drop off the -cluster for brief periods. This instability causes shards to relocate frequently -as Elasticsearch tries to keep the cluster balanced and enough replicas available. This in -turn increases network traffic and disk I/O, all while your cluster is attempting -to service the normal indexing and query load. +A cluster that is frequently experiencing long GC will be a cluster that is +under heavy load with not enough memory. These long GCs will make nodes drop off +the cluster for brief periods. This instability causes shards to relocate +frequently as Elasticsearch tries to keep the cluster balanced and enough +replicas available. This in turn increases network traffic and disk I/O, all +while your cluster is attempting to service the normal indexing and query load. In short, long GCs are bad and need to be minimized as much as possible. ********************************** -Because garbage collection is so critical to Elasticsearch, you should become intimately -familiar with this section of the `node-stats` API: +Because garbage collection is so critical to Elasticsearch, you should become +intimately familiar with this section of the `node-stats` API: [source,js] ---- @@ -298,21 +304,22 @@ familiar with this section of the `node-stats` API: ---- -- The `jvm` section first lists some general stats about heap memory usage. You -can see how much of the heap is being used, how much is committed (actually allocated -to the process), and the max size the heap is allowed to grow to. Ideally, -`heap_committed_in_bytes` should be identical to `heap_max_in_bytes`. If the -committed size is smaller, the JVM will have to resize the heap eventually--and this is a very expensive process. If your numbers are not identical, see -<> for how to configure it correctly. +- The `jvm` section first lists some general stats about heap memory usage. You +can see how much of the heap is being used, how much is committed (actually +allocated to the process), and the max size the heap is allowed to grow to. +Ideally, `heap_committed_in_bytes` should be identical to `heap_max_in_bytes`. +If the committed size is smaller, the JVM will have to resize the heap +eventually--and this is a very expensive process. If your numbers are not +identical, see <> for how to configure it correctly. + -The `heap_used_percent` metric is a useful number to keep an eye on. Elasticsearch -is configured to initiate GCs when the heap reaches 75% full. If your node is -consistently >= 75%, your node is experiencing _memory pressure_. +The `heap_used_percent` metric is a useful number to keep an eye on. +Elasticsearch is configured to initiate GCs when the heap reaches 75% full. If +your node is consistently >= 75%, your node is experiencing _memory pressure_. This is a warning sign that slow GCs may be in your near future. + -If the heap usage is consistently >=85%, you are in trouble. Heaps over 90–95% -are in risk of horrible performance with long 10–30s GCs at best, and out-of-memory -(OOM) exceptions at worst. +If the heap usage is consistently >=85%, you are in trouble. Heaps over +90–95% are in risk of horrible performance with long 10–30s GCs at +best, and out-of-memory (OOM) exceptions at worst. [source,js] ---- @@ -339,9 +346,10 @@ are in risk of horrible performance with long 10–30s GCs at best, and out }, ---- -- The `young`, `survivor`, and `old` sections will give you a breakdown of memory -usage of each generation in the GC. These stats are handy for keeping an eye on -relative sizes, but are often not overly important when debugging problems. +- The `young`, `survivor`, and `old` sections will give you a breakdown of +memory usage of each generation in the GC. These stats are handy for keeping an +eye on relative sizes, but are often not overly important when debugging +problems. [source,js] ---- @@ -360,33 +368,32 @@ relative sizes, but are often not overly important when debugging problems. ---- - `gc` section shows the garbage collection counts and cumulative time for both -young and old generations. You can safely ignore the young generation counts -for the most part: this number will usually be large. That is perfectly -normal. +young and old generations. You can safely ignore the young generation counts for +the most part: this number will usually be large. That is perfectly normal. + -In contrast, the old generation collection count should remain small, and -have a small `collection_time_in_millis`. These are cumulative counts, so it is -hard to give an exact number when you should start worrying (for example, a node with a -one-year uptime will have a large count even if it is healthy). This is one of the -reasons that tools such as Marvel are so helpful. GC counts _over time_ are the -important consideration. +In contrast, the old generation collection count should remain small, and have a +small `collection_time_in_millis`. These are cumulative counts, so it is hard to +give an exact number when you should start worrying (for example, a node with a +one-year uptime will have a large count even if it is healthy). This is one of +the reasons that tools such as Monitoring are so helpful. GC counts _over time_ +are the important consideration. + -Time spent GC'ing is also important. For example, a certain amount of garbage -is generated while indexing documents. This is normal and causes a GC every -now and then. These GCs are almost always fast and have little effect on the -node: young generation takes a millisecond or two, and old generation takes -a few hundred milliseconds. This is much different from 10-second GCs. +Time spent GC'ing is also important. For example, a certain amount of garbage is +generated while indexing documents. This is normal and causes a GC every now and +then. These GCs are almost always fast and have little effect on the node: young +generation takes a millisecond or two, and old generation takes a few hundred +milliseconds. This is much different from 10-second GCs. + -Our best advice is to collect collection counts and duration periodically (or use Marvel) -and keep an eye out for frequent GCs. You can also enable slow-GC logging, -discussed in <>. +Our best advice is to collect collection counts and duration periodically (or +use Monitoring) and keep an eye out for frequent GCs. You can also enable +slow-GC logging, discussed in <>. ==== Threadpool Section -Elasticsearch maintains threadpools internally. ((("threadpools", "statistics on"))) These threadpools -cooperate to get work done, passing work between each other as necessary. In -general, you don't need to configure or tune the threadpools, but it is sometimes -useful to see their stats so you can gain insight into how your cluster is behaving. +Elasticsearch maintains threadpools internally. These threadpools cooperate to +get work done, passing work between each other as necessary. In general, you +don't need to configure or tune the threadpools, but it is sometimes useful to +see their stats so you can gain insight into how your cluster is behaving. There are about a dozen threadpools, but they all share the same format: @@ -402,45 +409,46 @@ There are about a dozen threadpools, but they all share the same format: } ---- -Each threadpool lists the number of threads that are configured (`threads`), -how many of those threads are actively processing some work (`active`), and how -many work units are sitting in a queue (`queue`). +Each threadpool lists the number of threads that are configured (`threads`), how +many of those threads are actively processing some work (`active`), and how many +work units are sitting in a queue (`queue`). -If the queue fills up to its limit, new work units will begin to be rejected, and -you will see that reflected in the `rejected` statistic. This is often a sign -that your cluster is starting to bottleneck on some resources, since a full +If the queue fills up to its limit, new work units will begin to be rejected, +and you will see that reflected in the `rejected` statistic. This is often a +sign that your cluster is starting to bottleneck on some resources, since a full queue means your node/cluster is processing at maximum speed but unable to keep up with the influx of work. .Bulk Rejections **** -If you are going to encounter queue rejections, it will most likely be caused -by bulk indexing requests.((("bulk API", "rejections of bulk requests"))) It is easy to send many bulk requests to Elasticsearch -by using concurrent import processes. More is better, right? +If you are going to encounter queue rejections, it will most likely be caused by +bulk indexing requests. It is easy to send many bulk requests to Elasticsearch +by using concurrent import processes. More is better, right? In reality, each cluster has a certain limit at which it can not keep up with -ingestion. Once this threshold is crossed, the queue will quickly fill up, and +ingestion. Once this threshold is crossed, the queue will quickly fill up, and new bulks will be rejected. -This is a _good thing_. Queue rejections are a useful form of back pressure. They -let you know that your cluster is at maximum capacity, which is much better than -sticking data into an in-memory queue. Increasing the queue size doesn't increase -performance; it just hides the problem. If your cluster can process only 10,000 -docs per second, it doesn't matter whether the queue is 100 or 10,000,000--your cluster can -still process only 10,000 docs per second. +This is a _good thing_. Queue rejections are a useful form of back pressure. +They let you know that your cluster is at maximum capacity, which is much better +than sticking data into an in-memory queue. Increasing the queue size doesn't +increase performance; it just hides the problem. If your cluster can process +only 10,000 docs per second, it doesn't matter whether the queue is 100 or +10,000,000--your cluster can still process only 10,000 docs per second. -The queue simply hides the performance problem and carries a real risk of data-loss. -Anything sitting in a queue is by definition not processed yet. If the node -goes down, all those requests are lost forever. Furthermore, the queue eats -up a lot of memory, which is not ideal. +The queue simply hides the performance problem and carries a real risk of +data-loss. Anything sitting in a queue is by definition not processed yet. If +the node goes down, all those requests are lost forever. Furthermore, the queue +eats up a lot of memory, which is not ideal. It is much better to handle queuing in your application by gracefully handling -the back pressure from a full queue. When you receive bulk rejections, you should take these steps: +the back pressure from a full queue. When you receive bulk rejections, you +should take these steps: 1. Pause the import thread for 3–5 seconds. -2. Extract the rejected actions from the bulk response, since it is probable that -many of the actions were successful. The bulk response will tell you which succeeded -and which were rejected. +2. Extract the rejected actions from the bulk response, since it is probable +that many of the actions were successful. The bulk response will tell you which +succeeded and which were rejected. 3. Send a new bulk request with just the rejected actions. 4. Repeat from step 1 if rejections are encountered again. @@ -450,8 +458,8 @@ naturally backs off. Rejections are not errors: they just mean you should try again later. **** -There are a dozen threadpools. Most you can safely ignore, but a few -are good to keep an eye on: +There are a dozen threadpools. Most you can safely ignore, but a few are good to +keep an eye on: `indexing`:: Threadpool for normal indexing requests @@ -470,16 +478,16 @@ are good to keep an eye on: ==== FS and Network Sections -Continuing down the `node-stats` API, you'll see a((("filesystem, statistics on"))) bunch of statistics about your -filesystem: free space, data directory paths, disk I/O stats, and more. If you are -not monitoring free disk space, you can get those stats here. The disk I/O stats -are also handy, but often more specialized command-line tools (`iostat`, for example) -are more useful. +Continuing down the `node-stats` API, you'll see a bunch of statistics about +your filesystem: free space, data directory paths, disk I/O stats, and more. If +you are not monitoring free disk space, you can get those stats here. The disk +I/O stats are also handy, but often more specialized command-line tools +(`iostat`, for example) are more useful. Obviously, Elasticsearch has a difficult time functioning if you run out of disk space--so make sure you don't. -There are also two sections on ((("network", "statistics on")))network statistics: +There are also two sections on network statistics: [source,js] ---- @@ -496,21 +504,21 @@ There are also two sections on ((("network", "statistics on")))network statistic }, ---- -- `transport` shows some basic stats about the _transport address_. This -relates to inter-node communication (often on port 9300) and any transport client -or node client connections. Don't worry if you see many connections here; +- `transport` shows some basic stats about the _transport address_. This relates +to inter-node communication (often on port 9300) and any transport client or +node client connections. Don't worry if you see many connections here; Elasticsearch maintains a large number of connections between nodes. -- `http` represents stats about the HTTP port (often 9200). If you see a very +- `http` represents stats about the HTTP port (often 9200). If you see a very large `total_opened` number that is constantly increasing, that is a sure sign -that one of your HTTP clients is not using keep-alive connections. Persistent, -keep-alive connections are important for performance, since building up and tearing -down sockets is expensive (and wastes file descriptors). Make sure your clients -are configured appropriately. +that one of your HTTP clients is not using keep-alive connections. Persistent, +keep-alive connections are important for performance, since building up and +tearing down sockets is expensive (and wastes file descriptors). Make sure your +clients are configured appropriately. ==== Circuit Breaker -Finally, we come to the last section: stats about the((("fielddata circuit breaker"))) fielddata circuit breaker +Finally, we come to the last section: stats about the fielddata circuit breaker (introduced in <>): [role="pagebreak-before"] @@ -527,11 +535,12 @@ Finally, we come to the last section: stats about the((("fielddata circuit break ---- Here, you can determine the maximum circuit-breaker size (for example, at what -size the circuit breaker will trip if a query attempts to use more memory). This section -will also let you know the number of times the circuit breaker has been tripped, and -the currently configured overhead. The overhead is used to pad estimates, because some queries are more difficult to estimate than others. +size the circuit breaker will trip if a query attempts to use more memory). This +section will also let you know the number of times the circuit breaker has been +tripped, and the currently configured overhead. The overhead is used to pad +estimates, because some queries are more difficult to estimate than others. -The main thing to watch is the `tripped` metric. If this number is large or +The main thing to watch is the `tripped` metric. If this number is large or consistently increasing, it's a sign that your queries may need to be optimized or that you may need to obtain more memory (either per box or by adding more -nodes).((("Node Stats API", range="endofrange", startref="ix_NodeStats"))) +nodes). diff --git a/500_Cluster_Admin/40_other_stats.asciidoc b/500_Cluster_Admin/40_other_stats.asciidoc index 4d2a120c4..e345ff65d 100644 --- a/500_Cluster_Admin/40_other_stats.asciidoc +++ b/500_Cluster_Admin/40_other_stats.asciidoc @@ -1,16 +1,16 @@ === Cluster Stats -The `cluster-stats` API provides similar output to the `node-stats`.((("clusters", "administration", "Cluster Stats API"))) There -is one crucial difference: Node Stats shows you statistics per node, while +The `cluster-stats` API provides similar output to the `node-stats`. There is +one crucial difference: Node Stats shows you statistics per node, while `cluster-stats` shows you the sum total of all nodes in a single metric. -This provides some useful stats to glance at. You can see for example, that your entire cluster -is using 50% of the available heap or that filter cache is not evicting heavily. Its -main use is to provide a quick summary that is more extensive than -the `cluster-health`, but less detailed than `node-stats`. It is also useful for -clusters that are very large, which makes `node-stats` output difficult -to read. +This provides some useful stats to glance at. You can see for example, that your +entire cluster is using 50% of the available heap or that filter cache is not +evicting heavily. Its main use is to provide a quick summary that is more +extensive than the `cluster-health`, but less detailed than `node-stats`. It is +also useful for clusters that are very large, which makes `node-stats` output +difficult to read. The API may be invoked as follows: @@ -21,16 +21,16 @@ GET _cluster/stats === Index Stats -So far, we have been looking at _node-centric_ statistics:((("indices", "index statistics")))((("clusters", "administration", "index stats"))) How much memory does -this node have? How much CPU is being used? How many searches is this node +So far, we have been looking at _node-centric_ statistics: How much memory does +this node have? How much CPU is being used? How many searches is this node servicing? Sometimes it is useful to look at statistics from an _index-centric_ perspective: How many search requests is _this index_ receiving? How much time is spent fetching docs in _that index_? -To do this, select the index (or indices) that you are interested in and -execute an Index `stats` API: +To do this, select the index (or indices) that you are interested in and execute +an Index `stats` API: [source,js] ---- @@ -44,36 +44,36 @@ GET _all/_stats <3> <2> Stats for multiple indices can be requested by separating their names with a comma. <3> Stats for all indices can be requested using the special `_all` index name. -The stats returned will be familar to the `node-stats` output: `search` `fetch` `get` -`index` `bulk` `segment counts` and so forth +The stats returned will be familar to the `node-stats` output: `search` `fetch` +`get` `index` `bulk` `segment counts` and so forth Index-centric stats can be useful for identifying or verifying _hot_ indices inside your cluster, or trying to determine why some indices are faster/slower than others. -In practice, however, node-centric statistics tend to be more useful. Entire -nodes tend to bottleneck, not individual indices. And because indices -are usually spread across multiple nodes, index-centric statistics -are usually not very helpful because they aggregate data from different physical machines +In practice, however, node-centric statistics tend to be more useful. Entire +nodes tend to bottleneck, not individual indices. And because indices are +usually spread across multiple nodes, index-centric statistics are usually not +very helpful because they aggregate data from different physical machines operating in different environments. -Index-centric stats are a useful tool to keep in your repertoire, but are not usually -the first tool to reach for. +Index-centric stats are a useful tool to keep in your repertoire, but are not +usually the first tool to reach for. === Pending Tasks -There are certain tasks that only the master can perform, such as creating a new ((("clusters", "administration", "Pending Tasks API"))) -index or moving shards around the cluster. Since a cluster can have only one -master, only one node can ever process cluster-level metadata changes. For -99.9999% of the time, this is never a problem. The queue of metadata changes +There are certain tasks that only the master can perform, such as creating a new +index or moving shards around the cluster. Since a cluster can have only one +master, only one node can ever process cluster-level metadata changes. For +99.9999% of the time, this is never a problem. The queue of metadata changes remains essentially zero. -In some _rare_ clusters, the number of metadata changes occurs faster than -the master can process them. This leads to a buildup of pending actions that -are queued. +In some _rare_ clusters, the number of metadata changes occurs faster than the +master can process them. This leads to a buildup of pending actions that are +queued. -The `pending-tasks` API ((("Pending Tasks API")))will show you what (if any) cluster-level metadata changes -are pending in the queue: +The `pending-tasks` API will show you what (if any) cluster-level metadata +changes are pending in the queue: [source,js] ---- @@ -89,7 +89,7 @@ Usually, the response will look like this: } ---- -This means there are no pending tasks. If you have one of the rare clusters that +This means there are no pending tasks. If you have one of the rare clusters that bottlenecks on the master node, your pending task list may look like this: [source,js] @@ -106,7 +106,7 @@ bottlenecks on the master node, your pending task list may look like this: { "insert_order": 46, "priority": "HIGH", - "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], + "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]", "time_in_queue_millis": 842, "time_in_queue": "842ms" @@ -114,7 +114,7 @@ bottlenecks on the master node, your pending task list may look like this: { "insert_order": 45, "priority": "HIGH", - "source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], + "source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]", "time_in_queue_millis": 858, "time_in_queue": "858ms" @@ -123,51 +123,51 @@ bottlenecks on the master node, your pending task list may look like this: } ---- -You can see that tasks are assigned a priority (`URGENT` is processed before `HIGH`, -for example), the order it was inserted, how long the action has been queued and -what the action is trying to perform. In the preceding list, there is a `create-index` -action and two `shard-started` actions pending. +You can see that tasks are assigned a priority (`URGENT` is processed before +`HIGH`, for example), the order it was inserted, how long the action has been +queued and what the action is trying to perform. In the preceding list, there is +a `create-index` action and two `shard-started` actions pending. .When Should I Worry About Pending Tasks? **** -As mentioned, the master node is rarely the bottleneck for clusters. The only -time it could bottleneck is if the cluster state is both very large -_and_ updated frequently. +As mentioned, the master node is rarely the bottleneck for clusters. The only +time it could bottleneck is if the cluster state is both very large _and_ +updated frequently. -For example, if you allow customers to create as many dynamic fields as they wish, -and have a unique index for each customer every day, your cluster state will grow -very large. The cluster state includes (among other things) a list of all indices, -their types, and the fields for each index. +For example, if you allow customers to create as many dynamic fields as they +wish, and have a unique index for each customer every day, your cluster state +will grow very large. The cluster state includes (among other things) a list of +all indices, their types, and the fields for each index. So if you have 100,000 customers, and each customer averages 1,000 fields and 90 days of retention--that's nine billion fields to keep in the cluster state. -Whenever this changes, the nodes must be notified. +Whenever this changes, the nodes must be notified. The master must process these changes, which requires nontrivial CPU overhead, plus the network overhead of pushing the updated cluster state to all nodes. It is these clusters that may begin to see cluster-state actions queuing up. -There is no easy solution to this problem, however. You have three options: +There is no easy solution to this problem, however. You have three options: -- Obtain a beefier master node. Vertical scaling just delays the inevitable, -unfortunately. -- Restrict the dynamic nature of the documents in some way, so as to limit the -cluster-state size. +- Obtain a beefier master node. Vertical scaling just delays the inevitable, +unfortunately. +- Restrict the dynamic nature of the documents in some way, so as to limit the +cluster-state size. - Spin up another cluster after a certain threshold has been crossed. **** === cat API -If you work from the command line often, the `cat` APIs will be helpful -to you.((("Cat API")))((("clusters", "administration", "Cat API"))) Named after the linux `cat` command, these APIs are designed to -work like *nix command-line tools. +If you work from the command line often, the `cat` APIs will be helpful to +you. Named after the linux `cat` command, these APIs are designed to work like +*nix command-line tools. They provide statistics that are identical to all the previously discussed APIs -(Health, `node-stats`, and so forth), but present the output in tabular form instead of -JSON. This is _very_ convenient for a system administrator, and you just want -to glance over your cluster or find nodes with high memory usage. +(Health, `node-stats`, and so forth), but present the output in tabular form +instead of JSON. This is _very_ convenient for a system administrator, and you +just want to glance over your cluster or find nodes with high memory usage. -Executing a plain `GET` against the `cat` endpoint will show you all available +Executing a plain `GET` against the `cat` endpoint will show you all available APIs: [source,bash] @@ -198,20 +198,20 @@ GET /_cat /_cat/fielddata/{fields} ---- -Many of these APIs should look familiar to you (and yes, that's a cat at the top -:) ). Let's take a look at the Cat Health API: +Many of these APIs should look familiar to you (and yes, that's a cat at the top +:) ). Let's take a look at the Cat Health API: [source,bash] ---- GET /_cat/health -1408723713 12:08:33 elasticsearch_zach yellow 1 1 114 114 0 0 114 +1408723713 12:08:33 elasticsearch_zach yellow 1 1 114 114 0 0 114 ---- -The first thing you'll notice is that the response is plain text in tabular form, -not JSON. The second thing you'll notice is that there are no column headers -enabled by default. This is designed to emulate *nix tools, since it is assumed -that once you become familiar with the output, you no longer want to see +The first thing you'll notice is that the response is plain text in tabular +form, not JSON. The second thing you'll notice is that there are no column +headers enabled by default. This is designed to emulate *nix tools, since it is +assumed that once you become familiar with the output, you no longer want to see the headers. To enable headers, add the `?v` parameter: @@ -220,12 +220,12 @@ To enable headers, add the `?v` parameter: ---- GET /_cat/health?v -epoch time cluster status node.total node.data shards pri relo init -1408[..] 12[..] el[..] 1 1 114 114 0 0 114 +epoch time cluster status node.total node.data shards pri relo init +1408[..] 12[..] el[..] 1 1 114 114 0 0 114 unassign ---- -Ah, much better. We now see the timestamp, cluster name, status, the number of +Ah, much better. We now see the timestamp, cluster name, status, the number of nodes in the cluster, and more--all the same information as the `cluster-health` API. @@ -235,13 +235,13 @@ Let's look at `node-stats` in the `cat` API: ---- GET /_cat/nodes?v -host ip heap.percent ram.percent load node.role master name -zacharys-air 192.168.1.131 45 72 1.85 d * Zach +host ip heap.percent ram.percent load node.role master name +zacharys-air 192.168.1.131 45 72 1.85 d * Zach ---- -We see some stats about the nodes in our cluster, but the output is basic compared -to the full `node-stats` output. You can -include many additional metrics, but rather than consulting the documentation, let's just ask the `cat` +We see some stats about the nodes in our cluster, but the output is basic +compared to the full `node-stats` output. You can include many additional +metrics, but rather than consulting the documentation, let's just ask the `cat` API what is available. You can do this by adding `?help` to any API: @@ -250,118 +250,114 @@ You can do this by adding `?help` to any API: ---- GET /_cat/nodes?help -id | id,nodeId | unique node id -pid | p | process id -host | h | host name -ip | i | ip address -port | po | bound transport port -version | v | es version -build | b | es build hash -jdk | j | jdk version -disk.avail | d,disk,diskAvail | available disk space -heap.percent | hp,heapPercent | used heap ratio -heap.max | hm,heapMax | max configured heap -ram.percent | rp,ramPercent | used machine memory ratio -ram.max | rm,ramMax | total machine memory -load | l | most recent load avg -uptime | u | node uptime -node.role | r,role,dc,nodeRole | d:data node, c:client node -master | m | m:master-eligible, *:current master +id | id,nodeId | unique node id +pid | p | process id +host | h | host name +ip | i | ip address +port | po | bound transport port +version | v | es version +build | b | es build hash +jdk | j | jdk version +disk.avail | d,disk,diskAvail | available disk space +heap.percent | hp,heapPercent | used heap ratio +heap.max | hm,heapMax | max configured heap +ram.percent | rp,ramPercent | used machine memory ratio +ram.max | rm,ramMax | total machine memory +load | l | most recent load avg +uptime | u | node uptime +node.role | r,role,dc,nodeRole | d:data node, c:client node +master | m | m:master-eligible, *:current master ... ... ---- (Note that the output has been truncated for brevity). The first column shows the full name, the second column shows the short name, -and the third column offers a brief description about the parameter. Now that -we know some column names, we can ask for those explicitly by using the `?h` +and the third column offers a brief description about the parameter. Now that we +know some column names, we can ask for those explicitly by using the `?h` parameter: [source,bash] ---- GET /_cat/nodes?v&h=ip,port,heapPercent,heapMax -ip port heapPercent heapMax -192.168.1.131 9300 53 990.7mb +ip port heapPercent heapMax +192.168.1.131 9300 53 990.7mb ---- -Because the `cat` API tries to behave like *nix utilities, you can pipe the output -to other tools such as `sort` `grep` or `awk`. For example, we can find the largest -index in our cluster by using the following: +Because the `cat` API tries to behave like *nix utilities, you can pipe the +output to other tools such as `sort` `grep` or `awk`. For example, we can find +the largest index in our cluster by using the following: [source,bash] ---- % curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8 -yellow test_names 5 1 3476004 0 376324705 376324705 -yellow .marvel-2014.08.19 1 1 263878 0 160777194 160777194 -yellow .marvel-2014.08.15 1 1 234482 0 143020770 143020770 -yellow .marvel-2014.08.09 1 1 222532 0 138177271 138177271 -yellow .marvel-2014.08.18 1 1 225921 0 138116185 138116185 -yellow .marvel-2014.07.26 1 1 173423 0 132031505 132031505 -yellow .marvel-2014.08.21 1 1 219857 0 128414798 128414798 -yellow .marvel-2014.07.27 1 1 75202 0 56320862 56320862 -yellow wavelet 5 1 5979 0 54815185 54815185 -yellow .marvel-2014.07.28 1 1 57483 0 43006141 43006141 -yellow .marvel-2014.07.21 1 1 31134 0 27558507 27558507 -yellow .marvel-2014.08.01 1 1 41100 0 27000476 27000476 -yellow kibana-int 5 1 2 0 17791 17791 -yellow t 5 1 7 0 15280 15280 -yellow website 5 1 12 0 12631 12631 -yellow agg_analysis 5 1 5 0 5804 5804 -yellow v2 5 1 2 0 5410 5410 -yellow v1 5 1 2 0 5367 5367 -yellow bank 1 1 16 0 4303 4303 -yellow v 5 1 1 0 2954 2954 -yellow p 5 1 2 0 2939 2939 -yellow b0001_072320141238 5 1 1 0 2923 2923 -yellow ipaddr 5 1 1 0 2917 2917 -yellow v2a 5 1 1 0 2895 2895 -yellow movies 5 1 1 0 2738 2738 -yellow cars 5 1 0 0 1249 1249 -yellow wavelet2 5 1 0 0 615 615 +yellow test_names 5 1 3476004 0 376324705 376324705 +yellow .marvel-2014.08.19 1 1 263878 0 160777194 160777194 +yellow .marvel-2014.08.15 1 1 234482 0 143020770 143020770 +yellow .marvel-2014.08.09 1 1 222532 0 138177271 138177271 +yellow .marvel-2014.08.18 1 1 225921 0 138116185 138116185 +yellow .marvel-2014.07.26 1 1 173423 0 132031505 132031505 +yellow .marvel-2014.08.21 1 1 219857 0 128414798 128414798 +yellow .marvel-2014.07.27 1 1 75202 0 56320862 56320862 +yellow wavelet 5 1 5979 0 54815185 54815185 +yellow .marvel-2014.07.28 1 1 57483 0 43006141 43006141 +yellow .marvel-2014.07.21 1 1 31134 0 27558507 27558507 +yellow .marvel-2014.08.01 1 1 41100 0 27000476 27000476 +yellow kibana-int 5 1 2 0 17791 17791 +yellow t 5 1 7 0 15280 15280 +yellow website 5 1 12 0 12631 12631 +yellow agg_analysis 5 1 5 0 5804 5804 +yellow v2 5 1 2 0 5410 5410 +yellow v1 5 1 2 0 5367 5367 +yellow bank 1 1 16 0 4303 4303 +yellow v 5 1 1 0 2954 2954 +yellow p 5 1 2 0 2939 2939 +yellow b0001_072320141238 5 1 1 0 2923 2923 +yellow ipaddr 5 1 1 0 2917 2917 +yellow v2a 5 1 1 0 2895 2895 +yellow movies 5 1 1 0 2738 2738 +yellow cars 5 1 0 0 1249 1249 +yellow wavelet2 5 1 0 0 615 615 ---- By adding `?bytes=b`, we disable the human-readable formatting on numbers and -force them to be listed as bytes. This output is then piped into `sort` so that +force them to be listed as bytes. This output is then piped into `sort` so that our indices are ranked according to size (the eighth column). -Unfortunately, you'll notice that the Marvel indices are clogging up the results, -and we don't really care about those indices right now. Let's pipe the output -through `grep` and remove anything mentioning Marvel: +Unfortunately, you'll notice that the Marvel indices are clogging up the +results, and we don't really care about those indices right now. Let's pipe the +output through `grep` and remove anything mentioning Marvel: [source,bash] ---- % curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8 | grep -v marvel -yellow test_names 5 1 3476004 0 376324705 376324705 -yellow wavelet 5 1 5979 0 54815185 54815185 -yellow kibana-int 5 1 2 0 17791 17791 -yellow t 5 1 7 0 15280 15280 -yellow website 5 1 12 0 12631 12631 -yellow agg_analysis 5 1 5 0 5804 5804 -yellow v2 5 1 2 0 5410 5410 -yellow v1 5 1 2 0 5367 5367 -yellow bank 1 1 16 0 4303 4303 -yellow v 5 1 1 0 2954 2954 -yellow p 5 1 2 0 2939 2939 -yellow b0001_072320141238 5 1 1 0 2923 2923 -yellow ipaddr 5 1 1 0 2917 2917 -yellow v2a 5 1 1 0 2895 2895 -yellow movies 5 1 1 0 2738 2738 -yellow cars 5 1 0 0 1249 1249 -yellow wavelet2 5 1 0 0 615 615 +yellow test_names 5 1 3476004 0 376324705 376324705 +yellow wavelet 5 1 5979 0 54815185 54815185 +yellow kibana-int 5 1 2 0 17791 17791 +yellow t 5 1 7 0 15280 15280 +yellow website 5 1 12 0 12631 12631 +yellow agg_analysis 5 1 5 0 5804 5804 +yellow v2 5 1 2 0 5410 5410 +yellow v1 5 1 2 0 5367 5367 +yellow bank 1 1 16 0 4303 4303 +yellow v 5 1 1 0 2954 2954 +yellow p 5 1 2 0 2939 2939 +yellow b0001_072320141238 5 1 1 0 2923 2923 +yellow ipaddr 5 1 1 0 2917 2917 +yellow v2a 5 1 1 0 2895 2895 +yellow movies 5 1 1 0 2738 2738 +yellow cars 5 1 0 0 1249 1249 +yellow wavelet2 5 1 0 0 615 615 ---- -Voila! After piping through `grep` (with `-v` to invert the matches), we get -a sorted list of indices without Marvel cluttering it up. +Voila! After piping through `grep` (with `-v` to invert the matches), we get a +sorted list of indices without Marvel cluttering it up. This is just a simple example of the flexibility of `cat` at the command line. -Once you get used to using `cat`, you'll see it like any other *nix tool and start -going crazy with piping, sorting, and grepping. If you are a system admin and spend -any time SSH'd into boxes, definitely spend some time getting familiar +Once you get used to using `cat`, you'll see it like any other *nix tool and +start going crazy with piping, sorting, and grepping. If you are a system admin +and spend any time SSH'd into boxes, definitely spend some time getting familiar with the `cat` API. - - - - diff --git a/510_Deployment/10_intro.asciidoc b/510_Deployment/10_intro.asciidoc index d5bf9be88..e67ace0d8 100644 --- a/510_Deployment/10_intro.asciidoc +++ b/510_Deployment/10_intro.asciidoc @@ -1,14 +1,13 @@ If you have made it this far in the book, hopefully you've learned a thing or -two about Elasticsearch and are ready to((("deployment"))) deploy your cluster to production.((("clusters", "deployment", see="deployment"))) -This chapter is not meant to be an exhaustive guide to running your cluster -in production, but it covers the key things to consider before putting -your cluster live. +two about Elasticsearch and are ready to deploy your cluster to production. This +chapter is not meant to be an exhaustive guide to running your cluster in +production, but it covers the key things to consider before putting your cluster +live. Three main areas are covered: - Logistical considerations, such as hardware recommendations and deployment strategies - Configuration changes that are more suited to a production environment -- Post-deployment considerations, such as security, maximizing indexing performance, -and backups - +- Post-deployment considerations, such as security, maximizing indexing +performance, and backups diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index ec3954462..b45280166 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,116 +1,118 @@ [[hardware]] === Hardware -If you've been following the normal development path, you've probably been playing((("deployment", "hardware")))((("hardware"))) -with Elasticsearch on your laptop or on a small cluster of machines lying around. -But when it comes time to deploy Elasticsearch to production, there are a few -recommendations that you should consider. Nothing is a hard-and-fast rule; -Elasticsearch is used for a wide range of tasks and on a bewildering array of -machines. But these recommendations provide good starting points based on our experience with -production clusters. +If you've been following the normal development path, you've probably been +playing with Elasticsearch on your laptop or on a small cluster of machines +lying around. But when it comes time to deploy Elasticsearch to production, +there are a few recommendations that you should consider. Nothing is a +hard-and-fast rule; Elasticsearch is used for a wide range of tasks and on a +bewildering array of machines. But these recommendations provide good starting +points based on our experience with production clusters. ==== Memory -If there is one resource that you will run out of first, it will likely be memory.((("hardware", "memory")))((("memory"))) -Sorting and aggregations can both be memory hungry, so enough heap space to -accommodate these is important.((("heap"))) Even when the heap is comparatively small, -extra memory can be given to the OS filesystem cache. Because many data structures -used by Lucene are disk-based formats, Elasticsearch leverages the OS cache to -great effect. +If there is one resource that you will run out of first, it will likely be +memory. Sorting and aggregations can both be memory hungry, so enough heap space +to accommodate these is important. Even when the heap is comparatively small, +extra memory can be given to the OS filesystem cache. Because many data +structures used by Lucene are disk-based formats, Elasticsearch leverages the OS +cache to great effect. -A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines -are also common. Less than 8 GB tends to be counterproductive (you end up -needing many, many small machines), and greater than 64 GB has problems that we will -discuss in <>. +A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB +machines are also common. Less than 8 GB tends to be counterproductive (you end +up needing many, many small machines), and greater than 64 GB has problems that +we will discuss in <>. ==== CPUs -Most Elasticsearch deployments tend to be rather light on CPU requirements. As -such,((("CPUs (central processing units)")))((("hardware", "CPUs"))) the exact processor setup matters less than the other resources. You should -choose a modern processor with multiple cores. Common clusters utilize two- to eight-core machines. +Most Elasticsearch deployments tend to be rather light on CPU requirements. As +such, the exact processor setup matters less than the other resources. You +should choose a modern processor with multiple cores. Common clusters utilize +two- to eight-core machines. -If you need to choose between faster CPUs or more cores, choose more cores. The +If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster clock speed. ==== Disks -Disks are important for all clusters,((("disks")))((("hardware", "disks"))) and doubly so for indexing-heavy clusters -(such as those that ingest log data). Disks are the slowest subsystem in a server, -which means that write-heavy clusters can easily saturate their disks, which in -turn become the bottleneck of the cluster. +Disks are important for all clusters, and doubly so for indexing-heavy clusters +(such as those that ingest log data). Disks are the slowest subsystem in a +server, which means that write-heavy clusters can easily saturate their disks, +which in turn become the bottleneck of the cluster. -If you can afford SSDs, they are by far superior to any spinning media. SSD-backed -nodes see boosts in both query and indexing performance. If you can afford it, -SSDs are the way to go. +If you can afford SSDs, they are by far superior to any spinning media. +SSD-backed nodes see boosts in both query and indexing performance. If you can +afford it, SSDs are the way to go. .Check Your I/O Scheduler **** -If you are using SSDs, make sure your OS I/O scheduler is((("I/O scheduler"))) configured correctly. +If you are using SSDs, make sure your OS I/O scheduler is configured correctly. When you write data to disk, the I/O scheduler decides when that data is -_actually_ sent to the disk. The default under most *nix distributions is a +_actually_ sent to the disk. The default under most *nix distributions is a scheduler called `cfq` (Completely Fair Queuing). This scheduler allocates _time slices_ to each process, and then optimizes the -delivery of these various queues to the disk. It is optimized for spinning media: -the nature of rotating platters means it is more efficient to write data to disk -based on physical layout. +delivery of these various queues to the disk. It is optimized for spinning +media: the nature of rotating platters means it is more efficient to write data +to disk based on physical layout. This is inefficient for SSD, however, since there are no spinning platters -involved. Instead, `deadline` or `noop` should be used instead. The deadline -scheduler optimizes based on how long writes have been pending, while `noop` -is just a simple FIFO queue. +involved. Instead, `deadline` or `noop` should be used instead. The deadline +scheduler optimizes based on how long writes have been pending, while `noop` is +just a simple FIFO queue. -This simple change can have dramatic impacts. We've seen a 500-fold improvement +This simple change can have dramatic impacts. We've seen a 500-fold improvement to write throughput just by using the correct scheduler. **** -If you use spinning media, try to obtain the fastest disks possible (high-performance server disks, 15k RPM drives). +If you use spinning media, try to obtain the fastest disks possible +(high-performance server disks, 15k RPM drives). Using RAID 0 is an effective way to increase disk speed, for both spinning disks -and SSD. There is no need to use mirroring or parity variants of RAID, since +and SSD. There is no need to use mirroring or parity variants of RAID, since high availability is built into Elasticsearch via replicas. -Finally, avoid network-attached storage (NAS). People routinely claim their -NAS solution is faster and more reliable than local drives. Despite these claims, -we have never seen NAS live up to its hype. NAS is often slower, displays -larger latencies with a wider deviation in average latency, and is a single -point of failure. +Finally, avoid network-attached storage (NAS). People routinely claim their NAS +solution is faster and more reliable than local drives. Despite these claims, we +have never seen NAS live up to its hype. NAS is often slower, displays larger +latencies with a wider deviation in average latency, and is a single point of +failure. ==== Network -A fast and reliable network is obviously important to performance in a distributed((("hardware", "network")))((("network"))) -system. Low latency helps ensure that nodes can communicate easily, while -high bandwidth helps shard movement and recovery. Modern data-center networking -(1 GbE, 10 GbE) is sufficient for the vast majority of clusters. +A fast and reliable network is obviously important to performance in a +distributed system. Low latency helps ensure that nodes can communicate easily, +while high bandwidth helps shard movement and recovery. Modern data-center +networking (1 GbE, 10 GbE) is sufficient for the vast majority of clusters. Avoid clusters that span multiple data centers, even if the data centers are -colocated in close proximity. Definitely avoid clusters that span large geographic -distances. +colocated in close proximity. Definitely avoid clusters that span large +geographic distances. Elasticsearch clusters assume that all nodes are equal--not that half the nodes are actually 150ms distant in another data center. Larger latencies tend to exacerbate problems in distributed systems and make debugging and resolution more difficult. -Similar to the NAS argument, everyone claims that their pipe between data centers is -robust and low latency. This is true--until it isn't (a network failure will -happen eventually; you can count on it). From our experience, the hassle of -managing cross–data center clusters is simply not worth the cost. +Similar to the NAS argument, everyone claims that their pipe between data +centers is robust and low latency. This is true--until it isn't (a network +failure will happen eventually; you can count on it). From our experience, the +hassle of managing cross–data center clusters is simply not worth the +cost. ==== General Considerations -It is possible nowadays to obtain truly enormous machines:((("hardware", "general considerations"))) hundreds of gigabytes -of RAM with dozens of CPU cores. Conversely, it is also possible to spin up -thousands of small virtual machines in cloud platforms such as EC2. Which +It is possible nowadays to obtain truly enormous machines: hundreds of gigabytes +of RAM with dozens of CPU cores. Conversely, it is also possible to spin up +thousands of small virtual machines in cloud platforms such as EC2. Which approach is best? -In general, it is better to prefer medium-to-large boxes. Avoid small machines, -because you don't want to manage a cluster with a thousand nodes, and the overhead -of simply running Elasticsearch is more apparent on such small boxes. - -At the same time, avoid the truly enormous machines. They often lead to imbalanced -resource usage (for example, all the memory is being used, but none of the CPU) and can -add logistical complexity if you have to run multiple nodes per machine. - +In general, it is better to prefer medium-to-large boxes. Avoid small machines, +because you don't want to manage a cluster with a thousand nodes, and the +overhead of simply running Elasticsearch is more apparent on such small boxes. +At the same time, avoid the truly enormous machines. They often lead to +imbalanced resource usage (for example, all the memory is being used, but none +of the CPU) and can add logistical complexity if you have to run multiple nodes +per machine. diff --git a/510_Deployment/30_other.asciidoc b/510_Deployment/30_other.asciidoc index dda8ae758..defcf43ec 100644 --- a/510_Deployment/30_other.asciidoc +++ b/510_Deployment/30_other.asciidoc @@ -2,77 +2,77 @@ === Java Virtual Machine You should always run the most recent version of the Java Virtual Machine (JVM), -unless otherwise stated on the Elasticsearch website.((("deployment", "Java Virtual Machine (JVM)")))((("JVM (Java Virtual Machine)")))((("Java Virtual Machine", see="JVM"))) Elasticsearch, and in -particular Lucene, is a demanding piece of software. The unit and integration -tests from Lucene often expose bugs in the JVM itself. These bugs range from -mild annoyances to serious segfaults, so it is best to use the latest version -of the JVM where possible. - -Java 8 is preferred over Java 7. Java 6 is no longer supported. Either Oracle or OpenJDK are acceptable. They are comparable in performance and stability. - -If your application is written in Java and you are using the transport client -or node client, make sure the JVM running your application is identical to the -server JVM. In few locations in Elasticsearch, Java's native serialization -is used (IP addresses, exceptions, and so forth). Unfortunately, Oracle has been known to -change the serialization format between minor releases, leading to strange errors. -This happens rarely, but it is best practice to keep the JVM versions identical -between client and server. +unless otherwise stated on the Elasticsearch website. Elasticsearch, and in +particular Lucene, is a demanding piece of software. The unit and integration +tests from Lucene often expose bugs in the JVM itself. These bugs range from +mild annoyances to serious segfaults, so it is best to use the latest version of +the JVM where possible. + +Java 8 is preferred over Java 7. Java 6 is no longer supported. Either Oracle or +OpenJDK are acceptable. They are comparable in performance and stability. + +If your application is written in Java and you are using the transport client or +node client, make sure the JVM running your application is identical to the +server JVM. In few locations in Elasticsearch, Java's native serialization is +used (IP addresses, exceptions, and so forth). Unfortunately, Oracle has been +known to change the serialization format between minor releases, leading to +strange errors. This happens rarely, but it is best practice to keep the JVM +versions identical between client and server. .Please Do Not Tweak JVM Settings **** -The JVM exposes dozens (hundreds even!) of settings, parameters, and configurations.((("JVM (Java Virtual Machine)", "avoiding custom configuration"))) -They allow you to tweak and tune almost every aspect of the JVM. - -When a knob is encountered, it is human nature to want to turn it. We implore -you to squash this desire and _not_ use custom JVM settings. Elasticsearch is -a complex piece of software, and the current JVM settings have been tuned -over years of real-world usage. - -It is easy to start turning knobs, producing opaque effects that are hard to measure, -and eventually detune your cluster into a slow, unstable mess. When debugging -clusters, the first step is often to remove all custom configurations. About -half the time, this alone restores stability and performance. +The JVM exposes dozens (hundreds even!) of settings, parameters, and +configurations. They allow you to tweak and tune almost every aspect of the JVM. + +When a knob is encountered, it is human nature to want to turn it. We implore +you to squash this desire and _not_ use custom JVM settings. Elasticsearch is a +complex piece of software, and the current JVM settings have been tuned over +years of real-world usage. + +It is easy to start turning knobs, producing opaque effects that are hard to +measure, and eventually detune your cluster into a slow, unstable mess. When +debugging clusters, the first step is often to remove all custom configurations. +About half the time, this alone restores stability and performance. **** === Transport Client Versus Node Client -If you are using Java, you may wonder when to use the transport client versus the -node client.((("Java", "clients for Elasticsearch")))((("clients")))((("node client", "versus transport client")))((("transport client", "versus node client"))) As discussed at the beginning of the book, the transport client -acts as a communication layer between the cluster and your application. It knows -the API and can automatically round-robin between nodes, sniff the cluster for you, -and more. But it is _external_ to the cluster, similar to the REST clients. +If you are using Java, you may wonder when to use the transport client versus +the node client. As discussed at the beginning of the book, the transport client +acts as a communication layer between the cluster and your application. It knows +the API and can automatically round-robin between nodes, sniff the cluster for +you, and more. But it is _external_ to the cluster, similar to the REST clients. The node client, on the other hand, is actually a node within the cluster (but -does not hold data, and cannot become master). Because it is a node, it knows +does not hold data, and cannot become master). Because it is a node, it knows the entire cluster state (where all the nodes reside, which shards live in which nodes, and so forth). This means it can execute APIs with one less network hop. There are uses-cases for both clients: -- The transport client is ideal if you want to decouple your application from the -cluster. For example, if your application quickly creates and destroys -connections to the cluster, a transport client is much "lighter" than a node client, -since it is not part of a cluster. +- The transport client is ideal if you want to decouple your application from +the cluster. For example, if your application quickly creates and destroys +connections to the cluster, a transport client is much "lighter" than a node +client, since it is not part of a cluster. + Similarly, if you need to create thousands of connections, you don't want to -have thousands of node clients join the cluster. The TC will be a better choice. +have thousands of node clients join the cluster. The TC will be a better choice. - On the flipside, if you need only a few long-lived, persistent connection objects to the cluster, a node client can be a bit more efficient since it knows -the cluster layout. But it ties your application into the cluster, so it may +the cluster layout. But it ties your application into the cluster, so it may pose problems from a firewall perspective. === Configuration Management -If you use configuration management already (Puppet, Chef, Ansible), you can skip this tip.((("deployment", "configuration management")))((("configuration management"))) +If you use configuration management already (Puppet, Chef, Ansible), you can +skip this tip. -If you don't use configuration management tools yet, you should! Managing -a handful of servers by `parallel-ssh` may work now, but it will become a nightmare -as you grow your cluster. It is almost impossible to edit 30 configuration files -by hand without making a mistake. +If you don't use configuration management tools yet, you should! Managing a +handful of servers by `parallel-ssh` may work now, but it will become a +nightmare as you grow your cluster. It is almost impossible to edit 30 +configuration files by hand without making a mistake. Configuration management tools help make your cluster consistent by automating -the process of config changes. It may take a little time to set up and learn, +the process of config changes. It may take a little time to set up and learn, but it will pay itself off handsomely over time. - - diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 94ea5404b..1ec70c372 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -1,34 +1,35 @@ [[important-configuration-changes]] === Important Configuration Changes -Elasticsearch ships with _very good_ defaults,((("deployment", "configuration changes, important")))((("configuration changes, important"))) especially when it comes to performance- -related settings and options. When in doubt, just leave -the settings alone. We have witnessed countless dozens of clusters ruined -by errant settings because the administrator thought he could turn a knob -and gain 100-fold improvement. +Elasticsearch ships with _very good_ defaults, especially when it comes to +performance- related settings and options. When in doubt, just leave the +settings alone. We have witnessed countless dozens of clusters ruined by errant +settings because the administrator thought he could turn a knob and gain +100-fold improvement. [NOTE] ==== -Please read this entire section! All configurations presented are equally -important, and are not listed in any particular order. Please read -through all configuration options and apply them to your cluster. +Please read this entire section! All configurations presented are equally +important, and are not listed in any particular order. Please read through all +configuration options and apply them to your cluster. ==== -Other databases may require tuning, but by and large, Elasticsearch does not. -If you are hitting performance problems, the solution is usually better data -layout or more nodes. There are very few "magic knobs" in Elasticsearch. -If there were, we'd have turned them already! +Other databases may require tuning, but by and large, Elasticsearch does not. If +you are hitting performance problems, the solution is usually better data layout +or more nodes. There are very few "magic knobs" in Elasticsearch. If there were, +we'd have turned them already! -With that said, there are some _logistical_ configurations that should be changed -for production. These changes are necessary either to make your life easier, or because -there is no way to set a good default (because it depends on your cluster layout). +With that said, there are some _logistical_ configurations that should be +changed for production. These changes are necessary either to make your life +easier, or because there is no way to set a good default (because it depends on +your cluster layout). ==== Assign Names -Elasticseach by default starts a cluster named `elasticsearch`. ((("configuration changes, important", "assigning names"))) It is wise -to rename your production cluster to something else, simply to prevent accidents -whereby someone's laptop joins the cluster. A simple change to `elasticsearch_production` -can save a lot of heartache. +Elasticseach by default starts a cluster named `elasticsearch`. It is wise to +rename your production cluster to something else, simply to prevent accidents +whereby someone's laptop joins the cluster. A simple change to +`elasticsearch_production` can save a lot of heartache. This can be changed in your `elasticsearch.yml` file: @@ -38,16 +39,18 @@ cluster.name: elasticsearch_production ---- Similarly, it is wise to change the names of your nodes. As you've probably -noticed by now, Elasticsearch assigns a random Marvel superhero name -to your nodes at startup. This is cute in development--but less cute when it is -3a.m. and you are trying to remember which physical machine was Tagak the Leopard Lord. +noticed by now, Elasticsearch assigns a random Marvel superhero name to your +nodes at startup. This is cute in development--but less cute when it is 3a.m. +and you are trying to remember which physical machine was Tagak the Leopard +Lord. More important, since these names are generated on startup, each time you -restart your node, it will get a new name. This can make logs confusing, -since the names of all the nodes are constantly changing. +restart your node, it will get a new name. This can make logs confusing, since +the names of all the nodes are constantly changing. Boring as it might be, we recommend you give each node a name that makes sense -to you--a plain, descriptive name. This is also configured in your `elasticsearch.yml`: +to you--a plain, descriptive name. This is also configured in your +`elasticsearch.yml`: [source,yaml] ---- @@ -57,15 +60,16 @@ node.name: elasticsearch_005_data ==== Paths -By default, Elasticsearch will place the plug-ins,((("configuration changes, important", "paths"))) -((("paths"))) logs, and--most important--your data in the installation directory. This can lead to -unfortunate accidents, whereby the installation directory is accidentally overwritten -by a new installation of Elasticsearch. If you aren't careful, you can erase all your data. +By default, Elasticsearch will place the plug-ins, logs, and--most +important--your data in the installation directory. This can lead to unfortunate +accidents, whereby the installation directory is accidentally overwritten by a +new installation of Elasticsearch. If you aren't careful, you can erase all your +data. Don't laugh--we've seen it happen more than a few times. The best thing to do is relocate your data directory outside the installation -location. You can optionally move your plug-in and log directories as well. +location. You can optionally move your plug-in and log directories as well. This can be changed as follows: @@ -79,62 +83,66 @@ path.logs: /path/to/logs # Path to where plugins are installed: path.plugins: /path/to/plugins ---- -<1> Notice that you can specify more than one directory for data by using comma-separated lists. +<1> Notice that you can specify more than one directory for data by using +comma-separated lists. -Data can be saved to multiple directories, and if each directory -is mounted on a different hard drive, this is a simple and effective way to -set up a software RAID 0. Elasticsearch will automatically stripe -data between the different directories, boosting performance. +Data can be saved to multiple directories, and if each directory is mounted on a +different hard drive, this is a simple and effective way to set up a software +RAID 0. Elasticsearch will automatically stripe data between the different +directories, boosting performance. .Multiple data path safety and performance [WARNING] ==================== Like any RAID 0 configuration, only a single copy of your data is saved to the -hard drives. If you lose a hard drive, you are _guaranteed_ to lose a portion -of your data on that machine. With luck you'll have replicas elsewhere in the -cluster which can recover the data, and/or a recent <>. +hard drives. If you lose a hard drive, you are _guaranteed_ to lose a portion of +your data on that machine. With luck you'll have replicas elsewhere in the +cluster which can recover the data, and/or a recent <>. Elasticsearch attempts to minimize the extent of data loss by striping entire -shards to a drive. That means that `Shard 0` will be placed entirely on a single +shards to a drive. That means that `Shard 0` will be placed entirely on a single drive. Elasticsearch will not stripe a shard across multiple drives, since the loss of one drive would corrupt the entire shard. -This has ramifications for performance: if you are adding multiple drives -to improve the performance of a single index, it is unlikely to help since -most nodes will only have one shard, and thus one active drive. Multiple data -paths only helps if you have many indices/shards on a single node. +This has ramifications for performance: if you are adding multiple drives to +improve the performance of a single index, it is unlikely to help since most +nodes will only have one shard, and thus one active drive. Multiple data paths +only helps if you have many indices/shards on a single node. Multiple data paths is a nice convenience feature, but at the end of the day, -Elasticsearch is not a software RAID package. If you need more advanced configuration, -robustness and flexibility, we encourage you to use actual software RAID packages -instead of the multiple data path feature. +Elasticsearch is not a software RAID package. If you need more advanced +configuration, robustness and flexibility, we encourage you to use actual +software RAID packages instead of the multiple data path feature. ==================== ==== Minimum Master Nodes -The `minimum_master_nodes` setting is _extremely_ important to the -stability of your cluster.((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) This setting helps prevent _split brains_, the existence of two masters in a single cluster. +The `minimum_master_nodes` setting is _extremely_ important to the stability of +your cluster. This setting helps prevent _split brains_, the existence of two +masters in a single cluster. -When you have a split brain, your cluster is at danger of losing data. Because -the master is considered the supreme ruler of the cluster, it decides -when new indices can be created, how shards are moved, and so forth. If you have _two_ -masters, data integrity becomes perilous, since you have two nodes -that think they are in charge. +When you have a split brain, your cluster is at danger of losing data. Because +the master is considered the supreme ruler of the cluster, it decides when new +indices can be created, how shards are moved, and so forth. If you have _two_ +masters, data integrity becomes perilous, since you have two nodes that think +they are in charge. This setting tells Elasticsearch to not elect a master unless there are enough -master-eligible nodes available. Only then will an election take place. +master-eligible nodes available. Only then will an election take place. -This setting should always be configured to a quorum (majority) of your master-eligible nodes.((("quorum"))) A quorum is `(number of master-eligible nodes / 2) + 1`. +This setting should always be configured to a quorum (majority) of your +master-eligible nodes. A quorum is `(number of master-eligible nodes / 2) + 1`. Here are some examples: - If you have ten regular nodes (can hold data, can become master), a quorum is `6`. -- If you have three dedicated master nodes and a hundred data nodes, the quorum is `2`, -since you need to count only nodes that are master eligible. -- If you have two regular nodes, you are in a conundrum. A quorum would be -`2`, but this means a loss of one node will make your cluster inoperable. A -setting of `1` will allow your cluster to function, but doesn't protect against -split brain. It is best to have a minimum of three nodes in situations like this. +- If you have three dedicated master nodes and a hundred data nodes, the quorum +is `2`, since you need to count only nodes that are master eligible. +- If you have two regular nodes, you are in a conundrum. A quorum would be `2`, +but this means a loss of one node will make your cluster inoperable. A setting +of `1` will allow your cluster to function, but doesn't protect against split +brain. It is best to have a minimum of three nodes in situations like this. This setting can be configured in your `elasticsearch.yml` file: @@ -144,12 +152,12 @@ discovery.zen.minimum_master_nodes: 2 ---- But because Elasticsearch clusters are dynamic, you could easily add or remove -nodes that will change the quorum. It would be extremely irritating if you had +nodes that will change the quorum. It would be extremely irritating if you had to push new configurations to each node and restart your whole cluster just to change the setting. For this reason, `minimum_master_nodes` (and other settings) can be configured -via a dynamic API call. You can change the setting while your cluster is online: +via a dynamic API call. You can change the setting while your cluster is online: [source,js] ---- @@ -161,39 +169,38 @@ PUT /_cluster/settings } ---- -This will become a persistent setting that takes precedence over whatever is -in the static configuration. You should modify this setting whenever you add or +This will become a persistent setting that takes precedence over whatever is in +the static configuration. You should modify this setting whenever you add or remove master-eligible nodes. ==== Recovery Settings -Several settings affect the behavior of shard recovery when -your cluster restarts.((("recovery settings")))((("configuration changes, important", "recovery settings"))) First, we need to understand what happens if nothing is -configured. +Several settings affect the behavior of shard recovery when your cluster +restarts. First, we need to understand what happens if nothing is configured. Imagine you have ten nodes, and each node holds a single shard--either a primary -or a replica--in a 5 primary / 1 replica index. You take your -entire cluster offline for maintenance (installing new drives, for example). When you -restart your cluster, it just so happens that five nodes come online before -the other five. - -Maybe the switch to the other five is being flaky, and they didn't -receive the restart command right away. Whatever the reason, you have five nodes -online. These five nodes will gossip with each other, elect a master, and form a -cluster. They notice that data is no longer evenly distributed, since five -nodes are missing from the cluster, and immediately start replicating new -shards between each other. - -Finally, your other five nodes turn on and join the cluster. These nodes see +or a replica--in a 5 primary / 1 replica index. You take your entire cluster +offline for maintenance (installing new drives, for example). When you restart +your cluster, it just so happens that five nodes come online before the other +five. + +Maybe the switch to the other five is being flaky, and they didn't receive the +restart command right away. Whatever the reason, you have five nodes online. +These five nodes will gossip with each other, elect a master, and form a +cluster. They notice that data is no longer evenly distributed, since five nodes +are missing from the cluster, and immediately start replicating new shards +between each other. + +Finally, your other five nodes turn on and join the cluster. These nodes see that _their_ data is being replicated to other nodes, so they delete their local -data (since it is now redundant, and may be outdated). Then the cluster starts +data (since it is now redundant, and may be outdated). Then the cluster starts to rebalance even more, since the cluster size just went from five to ten. During this whole process, your nodes are thrashing the disk and network, moving -data around--for no good reason. For large clusters with terabytes of data, -this useless shuffling of data can take a _really long time_. If all the nodes -had simply waited for the cluster to come online, all the data would have been -local and nothing would need to move. +data around--for no good reason. For large clusters with terabytes of data, this +useless shuffling of data can take a _really long time_. If all the nodes had +simply waited for the cluster to come online, all the data would have been local +and nothing would need to move. Now that we know the problem, we can configure a few settings to alleviate it. First, we need to give Elasticsearch a hard limit: @@ -203,11 +210,11 @@ First, we need to give Elasticsearch a hard limit: gateway.recover_after_nodes: 8 ---- -This will prevent Elasticsearch from starting a recovery until at least eight (data or master) nodes -are present. The value for this setting is a matter of personal preference: how -many nodes do you want present before you consider your cluster functional? -In this case, we are setting it to `8`, which means the cluster is inoperable -unless there are at least eight nodes. +This will prevent Elasticsearch from starting a recovery until at least eight +(data or master) nodes are present. The value for this setting is a matter of +personal preference: how many nodes do you want present before you consider your +cluster functional? In this case, we are setting it to `8`, which means the +cluster is inoperable unless there are at least eight nodes. Then we tell Elasticsearch how many nodes _should_ be in the cluster, and how long we want to wait for all those nodes: @@ -225,12 +232,12 @@ What this means is that Elasticsearch will do the following: whichever comes first. These three settings allow you to avoid the excessive shard swapping that can -occur on cluster restarts. It can literally make recovery take seconds instead +occur on cluster restarts. It can literally make recovery take seconds instead of hours. -NOTE: These settings can only be set in the `config/elasticsearch.yml` file or on -the command line (they are not dynamically updatable) and they are only relevant -during a full cluster restart. +NOTE: These settings can only be set in the `config/elasticsearch.yml` file or +on the command line (they are not dynamically updatable) and they are only +relevant during a full cluster restart. [[unicast]] ==== Prefer Unicast over Multicast @@ -239,23 +246,24 @@ Elasticsearch is configured to use unicast discovery out of the box to prevent nodes from accidentally joining a cluster. Only nodes running on the same machine will automatically form cluster. -While multicast is still https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[provided -as a plugin], it should never be used in production. The -last thing you want is for nodes to accidentally join your production network, simply -because they received an errant multicast ping. There is nothing wrong with -multicast _per se_. Multicast simply leads to silly problems, and can be a bit -more fragile (for example, a network engineer fiddles with the network without telling +While multicast is still +https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[provided +as a plugin], it should never be used in production. The last thing you want is +for nodes to accidentally join your production network, simply because they +received an errant multicast ping. There is nothing wrong with multicast _per +se_. Multicast simply leads to silly problems, and can be a bit more fragile +(for example, a network engineer fiddles with the network without telling you--and all of a sudden nodes can't find each other anymore). -To use unicast, you provide Elasticsearch a list of nodes that it should try to contact. -When a node contacts a member of the unicast list, it receives a full cluster -state that lists all of the nodes in the cluster. It then contacts -the master and joins the cluster. +To use unicast, you provide Elasticsearch a list of nodes that it should try to +contact. When a node contacts a member of the unicast list, it receives a full +cluster state that lists all of the nodes in the cluster. It then contacts the +master and joins the cluster. -This means your unicast list does not need to include all of the nodes in your cluster. -It just needs enough nodes that a new node can find someone to talk to. If you -use dedicated masters, just list your three dedicated masters and call it a day. -This setting is configured in `elasticsearch.yml`: +This means your unicast list does not need to include all of the nodes in your +cluster. It just needs enough nodes that a new node can find someone to talk to. +If you use dedicated masters, just list your three dedicated masters and call it +a day. This setting is configured in `elasticsearch.yml`: [source,yaml] ---- diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index 37506390f..1ca28a39c 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -2,83 +2,79 @@ === Don't Touch These Settings! There are a few hotspots in Elasticsearch that people just can't seem to avoid -tweaking. ((("deployment", "settings to leave unaltered"))) We understand: knobs just beg to be turned. But of all the knobs to turn, these you should _really_ leave alone. They are -often abused and will contribute to terrible stability or terrible performance. -Or both. +tweaking. We understand: knobs just beg to be turned. But of all the knobs to +turn, these you should _really_ leave alone. They are often abused and will +contribute to terrible stability or terrible performance. Or both. ==== Garbage Collector As briefly introduced in <>, the JVM uses a garbage -collector to free unused memory.((("garbage collector"))) This tip is really an extension of the last tip, -but deserves its own section for emphasis: +collector to free unused memory. This tip is really an extension of the last +tip, but deserves its own section for emphasis: Do not change the default garbage collector! -The default GC for Elasticsearch is Concurrent-Mark and Sweep (CMS).((("Concurrent-Mark and Sweep (CMS) garbage collector"))) This GC +The default GC for Elasticsearch is Concurrent-Mark and Sweep (CMS). This GC runs concurrently with the execution of the application so that it can minimize -pauses. It does, however, have two stop-the-world phases. It also has trouble +pauses. It does, however, have two stop-the-world phases. It also has trouble collecting large heaps. -Despite these downsides, it is currently the best GC for low-latency server software -like Elasticsearch. The official recommendation is to use CMS. +Despite these downsides, it is currently the best GC for low-latency server +software like Elasticsearch. The official recommendation is to use CMS. -There is a newer GC called the Garbage First GC (G1GC). ((("Garbage First GC (G1GC)"))) This newer GC is designed -to minimize pausing even more than CMS, and operate on large heaps. It works -by dividing the heap into regions and predicting which regions contain the most -reclaimable space. By collecting those regions first (_garbage first_), it can -minimize pauses and operate on very large heaps. +There is a newer GC called the Garbage First GC (G1GC). This newer GC is +designed to minimize pausing even more than CMS, and operate on large heaps. It +works by dividing the heap into regions and predicting which regions contain the +most reclaimable space. By collecting those regions first (_garbage first_), it +can minimize pauses and operate on very large heaps. -Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found routinely. -These bugs are usually of the segfault variety, and will cause hard crashes. -The Lucene test suite is brutal on GC algorithms, and it seems that G1GC hasn't -had the kinks worked out yet. +Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found +routinely. These bugs are usually of the segfault variety, and will cause hard +crashes. The Lucene test suite is brutal on GC algorithms, and it seems that +G1GC hasn't had the kinks worked out yet. We would like to recommend G1GC someday, but for now, it is simply not stable enough to meet the demands of Elasticsearch and Lucene. ==== Threadpools -Everyone _loves_ to tweak threadpools.((("threadpools"))) For whatever reason, it seems people -cannot resist increasing thread counts. Indexing a lot? More threads! Searching -a lot? More threads! Node idling 95% of the time? More threads! +Everyone _loves_ to tweak threadpools. For whatever reason, it seems people +cannot resist increasing thread counts. Indexing a lot? More threads! Searching +a lot? More threads! Node idling 95% of the time? More threads! -The default threadpool settings in Elasticsearch are very sensible. For all +The default threadpool settings in Elasticsearch are very sensible. For all threadpools (except `search`) the threadcount is set to the number of CPU cores. -If you have eight cores, you can be running only eight threads simultaneously. It makes -sense to assign only eight threads to any particular threadpool. +If you have eight cores, you can be running only eight threads simultaneously. +It makes sense to assign only eight threads to any particular threadpool. -Search gets a larger threadpool, and is configured to `int((# of cores * 3) / 2) + 1`. +Search gets a larger threadpool, and is configured to `int((# of cores * 3) / 2) + 1`. -You might argue that some threads can block (such as on a disk I/O operation), -which is why you need more threads. This is not a problem in Elasticsearch: -much of the disk I/O is handled by threads managed by Lucene, not Elasticsearch. +You might argue that some threads can block (such as on a disk I/O operation), +which is why you need more threads. This is not a problem in Elasticsearch: much +of the disk I/O is handled by threads managed by Lucene, not Elasticsearch. -Furthermore, threadpools cooperate by passing work between each other. You don't +Furthermore, threadpools cooperate by passing work between each other. You don't need to worry about a networking thread blocking because it is waiting on a disk -write. The networking thread will have long since handed off that work unit to +write. The networking thread will have long since handed off that work unit to another threadpool and gotten back to networking. -Finally, the compute capacity of your process is finite. Having more threads just forces -the processor to switch thread contexts. A processor can run only one thread -at a time, so when it needs to switch to a different thread, it stores the current -state (registers, and so forth) and loads another thread. If you are lucky, the switch -will happen on the same core. If you are unlucky, the switch may migrate to a -different core and require transport on an inter-core communication bus. +Finally, the compute capacity of your process is finite. Having more threads +just forces the processor to switch thread contexts. A processor can run only +one thread at a time, so when it needs to switch to a different thread, it +stores the current state (registers, and so forth) and loads another thread. If +you are lucky, the switch will happen on the same core. If you are unlucky, the +switch may migrate to a different core and require transport on an inter-core +communication bus. -This context switching eats up cycles simply by doing administrative housekeeping; estimates can peg it as high as 30μs on modern CPUs. So unless the thread -will be blocked for longer than 30μs, it is highly likely that that time would -have been better spent just processing and finishing early. +This context switching eats up cycles simply by doing administrative +housekeeping; estimates can peg it as high as 30μs on modern CPUs. So unless the +thread will be blocked for longer than 30μs, it is highly likely that that time +would have been better spent just processing and finishing early. -People routinely set threadpools to silly values. On eight core machines, we have -run across configs with 60, 100, or even 1000 threads. These settings will simply -thrash the CPU more than getting real work done. +People routinely set threadpools to silly values. On eight core machines, we +have run across configs with 60, 100, or even 1000 threads. These settings will +simply thrash the CPU more than getting real work done. -So. Next time you want to tweak a threadpool, please don't. And if you +So. Next time you want to tweak a threadpool, please don't. And if you _absolutely cannot resist_, please keep your core count in mind and perhaps set -the count to double. More than that is just a waste. - - - - - - +the count to double. More than that is just a waste. diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 7bac00cb6..976a86e2c 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -1,22 +1,22 @@ [[heap-sizing]] === Heap: Sizing and Swapping -The default installation of Elasticsearch is configured with a 1 GB heap. ((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting"))) For -just about every deployment, this number is usually too small. If you are using the -default heap values, your cluster is probably configured incorrectly. +The default installation of Elasticsearch is configured with a 1 GB heap. For +just about every deployment, this number is usually too small. If you are using +the default heap values, your cluster is probably configured incorrectly. -There are two ways to change the heap size in Elasticsearch. The easiest is to -set an environment variable called `ES_HEAP_SIZE`.((("ES_HEAP_SIZE environment variable"))) When the server process -starts, it will read this environment variable and set the heap accordingly. -As an example, you can set it via the command line as follows: +There are two ways to change the heap size in Elasticsearch. The easiest is to +set an environment variable called `ES_HEAP_SIZE`. When the server process +starts, it will read this environment variable and set the heap accordingly. As +an example, you can set it via the command line as follows: [source,bash] ---- export ES_HEAP_SIZE=10g ---- -Alternatively, you can pass in the heap size via a command-line argument when starting -the process, if that is easier for your setup: +Alternatively, you can pass in the heap size via a command-line argument when +starting the process, if that is easier for your setup: [source,bash] ---- @@ -25,76 +25,78 @@ the process, if that is easier for your setup: <1> Ensure that the min (`Xms`) and max (`Xmx`) sizes are the same to prevent the heap from resizing at runtime, a very costly process. -Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over setting -explicit `-Xmx` and `-Xms` values. +Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over +setting explicit `-Xmx` and `-Xms` values. ==== Give (less than) Half Your Memory to Lucene -A common problem is configuring a heap that is _too_ large. ((("heap", "sizing and setting", "giving half your memory to Lucene"))) You have a 64 GB -machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More +A common problem is configuring a heap that is _too_ large. You have a 64 GB +machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More is better! -Heap is definitely important to Elasticsearch. It is used by many in-memory data -structures to provide fast operation. But with that said, there is another major +Heap is definitely important to Elasticsearch. It is used by many in-memory data +structures to provide fast operation. But with that said, there is another major user of memory that is _off heap_: Lucene. -Lucene is designed to leverage the underlying OS for caching in-memory data structures.((("Lucene", "memory for"))) -Lucene segments are stored in individual files. Because segments are immutable, -these files never change. This makes them very cache friendly, and the underlying -OS will happily keep hot segments resident in memory for faster access. These segments -include both the inverted index (for fulltext search) and doc values (for aggregations). +Lucene is designed to leverage the underlying OS for caching in-memory data +structures. Lucene segments are stored in individual files. Because segments are +immutable, these files never change. This makes them very cache friendly, and +the underlying OS will happily keep hot segments resident in memory for faster +access. These segments include both the inverted index (for fulltext search) and +doc values (for aggregations). -Lucene's performance relies on this interaction with the OS. But if you give all -available memory to Elasticsearch's heap, there won't be any left over for Lucene. -This can seriously impact the performance. +Lucene's performance relies on this interaction with the OS. But if you give all +available memory to Elasticsearch's heap, there won't be any left over for +Lucene. This can seriously impact the performance. -The standard recommendation is to give 50% of the available memory to Elasticsearch -heap, while leaving the other 50% free. It won't go unused; Lucene will happily -gobble up whatever is left over. +The standard recommendation is to give 50% of the available memory to +Elasticsearch heap, while leaving the other 50% free. It won't go unused; Lucene +will happily gobble up whatever is left over. If you are not aggregating on analyzed string fields (e.g. you won't be needing <>) you can consider lowering the heap even -more. The smaller you can make the heap, the better performance you can expect +more. The smaller you can make the heap, the better performance you can expect from both Elasticsearch (faster GCs) and Lucene (more memory for caching). [[compressed_oops]] ==== Don't Cross 32 GB! -There is another reason to not allocate enormous heaps to Elasticsearch. As it turns((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))) -out, the HotSpot JVM uses a trick to compress object pointers when heaps are less -than around 32 GB. +There is another reason to not allocate enormous heaps to Elasticsearch. As it +turns out, the HotSpot JVM uses a trick to compress object pointers when heaps +are less than around 32 GB. In Java, all objects are allocated on the heap and referenced by a pointer. -Ordinary object pointers (OOP) point at these objects, and are traditionally -the size of the CPU's native _word_: either 32 bits or 64 bits, depending on the -processor. The pointer references the exact byte location of the value. - -For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit systems, -the heap size can get much larger, but the overhead of 64-bit pointers means there -is more wasted space simply because the pointer is larger. And worse than wasted -space, the larger pointers eat up more bandwidth when moving values between -main memory and various caches (LLC, L1, and so forth). - -Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed object pointers"))) -to get around this problem. Instead of pointing at exact byte locations in -memory, the pointers reference _object offsets_.((("object offsets"))) This means a 32-bit pointer can -reference four billion _objects_, rather than four billion bytes. Ultimately, this -means the heap can grow to around 32 GB of physical size while still using a 32-bit -pointer. +Ordinary object pointers (OOP) point at these objects, and are traditionally the +size of the CPU's native _word_: either 32 bits or 64 bits, depending on the +processor. The pointer references the exact byte location of the value. + +For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit +systems, the heap size can get much larger, but the overhead of 64-bit pointers +means there is more wasted space simply because the pointer is larger. And worse +than wasted space, the larger pointers eat up more bandwidth when moving values +between main memory and various caches (LLC, L1, and so forth). + +Java uses a trick called +https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed +oops] to get around this problem. Instead of pointing at exact byte locations in +memory, the pointers reference _object offsets_. This means a 32-bit pointer can +reference four billion _objects_, rather than four billion bytes. Ultimately, +this means the heap can grow to around 32 GB of physical size while still using +a 32-bit pointer. Once you cross that magical ~32 GB boundary, the pointers switch back to -ordinary object pointers. The size of each pointer grows, more CPU-memory -bandwidth is used, and you effectively lose memory. In fact, it takes until around -40–50 GB of allocated heap before you have the same _effective_ memory of a -heap just under 32 GB using compressed oops. +ordinary object pointers. The size of each pointer grows, more CPU-memory +bandwidth is used, and you effectively lose memory. In fact, it takes until +around 40–50 GB of allocated heap before you have the same _effective_ +memory of a heap just under 32 GB using compressed oops. The moral of the story is this: even when you have memory to spare, try to avoid -crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and +crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and makes the GC struggle with large heaps. ==== Just how far under 32gb should I set the JVM? -Unfortunately, that depends. The exact cutoff varies by JVMs and platforms. -If you want to play it safe, setting the heap to `31gb` is likely safe. +Unfortunately, that depends. The exact cutoff varies by JVMs and platforms. If +you want to play it safe, setting the heap to `31gb` is likely safe. Alternatively, you can verify the cutoff point for the HotSpot JVM by adding `-XX:+PrintFlagsFinal` to your JVM options and checking that the value of the UseCompressedOops flag is true. This will let you find the exact cutoff for your @@ -126,62 +128,67 @@ The moral of the story is that the exact cutoff to leverage compressed oops varies from JVM to JVM, so take caution when taking examples from elsewhere and be sure to check your system with your configuration and JVM. -Beginning with Elasticsearch v2.2.0, the startup log will actually tell you if your -JVM is using compressed OOPs or not. You'll see a log message like: +Beginning with Elasticsearch v2.2.0, the startup log will actually tell you if +your JVM is using compressed OOPs or not. You'll see a log message like: [source, bash] ---- [2015-12-16 13:53:33,417][INFO ][env] [Illyana Rasputin] heap size [989.8mb], compressed ordinary object pointers [true] ---- -Which indicates that compressed object pointers are being used. If they are not, +Which indicates that compressed object pointers are being used. If they are not, the message will say `[false]`. [role="pagebreak-before"] .I Have a Machine with 1 TB RAM! **** -The 32 GB line is fairly important. So what do you do when your machine has a lot -of memory? It is becoming increasingly common to see super-servers with 512–768 GB -of RAM. +The 32 GB line is fairly important. So what do you do when your machine has a +lot of memory? It is becoming increasingly common to see super-servers with +512–768 GB of RAM. First, we would recommend avoiding such large machines (see <>). But if you already have the machines, you have three practical options: -- Are you doing mostly full-text search? Consider giving 4-32 GB to Elasticsearch -and letting Lucene use the rest of memory via the OS filesystem cache. All that -memory will cache segments and lead to blisteringly fast full-text search. - -- Are you doing a lot of sorting/aggregations? Are most of your aggregations on numerics, -dates, geo_points and `not_analyzed` strings? You're in luck, your aggregations will be done on -memory-friendly doc values! Give Elasticsearch somewhere from 4-32 GB of memory and leave the -rest for the OS to cache doc values in memory. - -- Are you doing a lot of sorting/aggregations on analyzed strings (e.g. for word-tags, -or SigTerms, etc)? Unfortunately that means you'll need fielddata, which means you -need heap space. Instead of one node with a huge amount of RAM, consider running two or -more nodes on a single machine. Still adhere to the 50% rule, though. +- Are you doing mostly full-text search? Consider giving 4-32 GB to +Elasticsearch and letting Lucene use the rest of memory via the OS filesystem +cache. All that memory will cache segments and lead to blisteringly fast +full-text search. + +- Are you doing a lot of sorting/aggregations? Are most of your aggregations on +numerics, dates, geo_points and `not_analyzed` strings? You're in luck, your +aggregations will be done on memory-friendly doc values! Give Elasticsearch +somewhere from 4-32 GB of memory and leave the rest for the OS to cache doc +values in memory. + +- Are you doing a lot of sorting/aggregations on analyzed strings (e.g. for +word-tags, or SigTerms, etc)? Unfortunately that means you'll need fielddata, +which means you need heap space. Instead of one node with a huge amount of RAM, +consider running two or more nodes on a single machine. Still adhere to the 50% +rule, though. + -So if your machine has 128 GB of RAM, run two nodes each with just under 32 GB. This means that less -than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. +So if your machine has 128 GB of RAM, run two nodes each with just under 32 GB. +This means that less than 64 GB will be used for heaps, and more than 64 GB will +be left over for Lucene. + -If you choose this option, set `cluster.routing.allocation.same_shard.host: true` -in your config. This will prevent a primary and a replica shard from colocating -to the same physical machine (since this would remove the benefits of replica high availability). +If you choose this option, set `cluster.routing.allocation.same_shard.host: +true` in your config. This will prevent a primary and a replica shard from +colocating to the same physical machine (since this would remove the benefits of +replica high availability). **** ==== Swapping Is the Death of Performance -It should be obvious,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance"))) but it bears spelling out clearly: swapping main memory -to disk will _crush_ server performance. Think about it: an in-memory operation -is one that needs to execute quickly. +It should be obvious, but it bears spelling out clearly: swapping main memory to +disk will _crush_ server performance. Think about it: an in-memory operation is +one that needs to execute quickly. If memory swaps to disk, a 100-microsecond operation becomes one that take 10 -milliseconds. Now repeat that increase in latency for all other 10us operations. +milliseconds. Now repeat that increase in latency for all other 10us operations. It isn't difficult to see why swapping is terrible for performance. -The best thing to do is disable swap completely on your system. This can be done +The best thing to do is disable swap completely on your system. This can be done temporarily: [source,bash] @@ -189,13 +196,13 @@ temporarily: sudo swapoff -a ---- -To disable it permanently, you'll likely need to edit your `/etc/fstab`. Consult +To disable it permanently, you'll likely need to edit your `/etc/fstab`. Consult the documentation for your OS. -If disabling swap completely is not an option, you can try to lower `swappiness`. -This value controls how aggressively the OS tries to swap memory. -This prevents swapping under normal circumstances, but still allows the OS to swap -under emergency memory situations. +If disabling swap completely is not an option, you can try to lower +`swappiness`. This value controls how aggressively the OS tries to swap memory. +This prevents swapping under normal circumstances, but still allows the OS to +swap under emergency memory situations. For most Linux systems, this is configured using the `sysctl` value: @@ -203,12 +210,12 @@ For most Linux systems, this is configured using the `sysctl` value: ---- vm.swappiness = 1 <1> ---- -<1> A `swappiness` of `1` is better than `0`, since on some kernel versions a `swappiness` -of `0` can invoke the OOM-killer. +<1> A `swappiness` of `1` is better than `0`, since on some kernel versions a +`swappiness` of `0` can invoke the OOM-killer. -Finally, if neither approach is possible, you should enable `mlockall`. - file. This allows the JVM to lock its memory and prevent -it from being swapped by the OS. In your `elasticsearch.yml`, set this: +Finally, if neither approach is possible, you should enable `mlockall`. file. + This allows the JVM to lock its memory and prevent it from being swapped by + the OS. In your `elasticsearch.yml`, set this: [source,yaml] ---- diff --git a/510_Deployment/60_file_descriptors.asciidoc b/510_Deployment/60_file_descriptors.asciidoc index 41a675086..72570eaea 100644 --- a/510_Deployment/60_file_descriptors.asciidoc +++ b/510_Deployment/60_file_descriptors.asciidoc @@ -1,21 +1,21 @@ -=== File Descriptors and MMap +=== File Descriptors and MMap -Lucene uses a _very_ large number of files. ((("deployment", "file descriptors and MMap"))) At the same time, Elasticsearch -uses a large number of sockets to communicate between nodes and HTTP clients. -All of this requires available file descriptors.((("file descriptors"))) +Lucene uses a _very_ large number of files. At the same time, Elasticsearch uses +a large number of sockets to communicate between nodes and HTTP clients. All of +this requires available file descriptors. Sadly, many modern Linux distributions ship with a paltry 1,024 file descriptors -allowed per process. This is _far_ too low for even a small Elasticsearch -node, let alone one that is handling hundreds of indices. +allowed per process. This is _far_ too low for even a small Elasticsearch node, +let alone one that is handling hundreds of indices. You should increase your file descriptor count to something very large, such as -64,000. This process is irritatingly difficult and highly dependent on your -particular OS and distribution. Consult the documentation for your OS to determine -how best to change the allowed file descriptor count. +64,000. This process is irritatingly difficult and highly dependent on your +particular OS and distribution. Consult the documentation for your OS to +determine how best to change the allowed file descriptor count. -Once you think you've changed it, check Elasticsearch to make sure it really does -have enough file descriptors: +Once you think you've changed it, check Elasticsearch to make sure it really +does have enough file descriptors: [source,js] ---- @@ -47,20 +47,17 @@ have enough file descriptors: } } ---- -<1> The `max_file_descriptors` field shows the number of available descriptors that -the Elasticsearch process can access. +<1> The `max_file_descriptors` field shows the number of available descriptors +that the Elasticsearch process can access. -Elasticsearch also uses a mix of NioFS and MMapFS ((("MMapFS")))for the various files. Ensure -that you configure the maximum map count so that there is ample virtual memory available for -mmapped files. This can be set temporarily: +Elasticsearch also uses a mix of NioFS and MMapFS ((("MMapFS")))for the various +files. Ensure that you configure the maximum map count so that there is ample +virtual memory available for mmapped files. This can be set temporarily: [source,js] ---- sysctl -w vm.max_map_count=262144 ---- -Or you can set it permanently by modifying `vm.max_map_count` setting in your `/etc/sysctl.conf`. - - - - +Or you can set it permanently by modifying `vm.max_map_count` setting in your +`/etc/sysctl.conf`. diff --git a/510_Deployment/70_conclusion.asciidoc b/510_Deployment/70_conclusion.asciidoc index b217c993c..c96c78d4a 100644 --- a/510_Deployment/70_conclusion.asciidoc +++ b/510_Deployment/70_conclusion.asciidoc @@ -1,17 +1,17 @@ === Revisit This List Before Production -You are likely reading this section before you go into production. -The details covered in this chapter are good to be generally aware of, but it is -critical to revisit this entire list right before deploying to production. +You are likely reading this section before you go into production. The details +covered in this chapter are good to be generally aware of, but it is critical to +revisit this entire list right before deploying to production. Some of the topics will simply stop you cold (such as too few available file -descriptors). These are easy enough to debug because they are quickly apparent. +descriptors). These are easy enough to debug because they are quickly apparent. Other issues, such as split brains and memory settings, are visible only after -something bad happens. At that point, the resolution is often messy and tedious. +something bad happens. At that point, the resolution is often messy and tedious. -It is much better to proactively prevent these situations from occurring by configuring -your cluster appropriately _before_ disaster strikes. So if you are going to -dog-ear (or bookmark) one section from the entire book, this chapter would be -a good candidate. The week before deploying to production, simply flip through -the list presented here and check off all the recommendations. +It is much better to proactively prevent these situations from occurring by +configuring your cluster appropriately _before_ disaster strikes. So if you are +going to dog-ear (or bookmark) one section from the entire book, this chapter +would be a good candidate. The week before deploying to production, simply flip +through the list presented here and check off all the recommendations. diff --git a/520_Post_Deployment/10_dynamic_settings.asciidoc b/520_Post_Deployment/10_dynamic_settings.asciidoc index caffdf012..9a381db8c 100644 --- a/520_Post_Deployment/10_dynamic_settings.asciidoc +++ b/520_Post_Deployment/10_dynamic_settings.asciidoc @@ -2,19 +2,19 @@ === Changing Settings Dynamically Many settings in Elasticsearch are dynamic and can be modified through the API. -Configuration changes that force a node (or cluster) restart are strenuously avoided.((("post-deployment", "changing settings dynamically"))) -And while it's possible to make the changes through the static configs, we -recommend that you use the API instead. +Configuration changes that force a node (or cluster) restart are strenuously +avoided. And while it's possible to make the changes through the static configs, +we recommend that you use the API instead. The `cluster-update` API operates((("Cluster Update API"))) in two modes: -Transient:: - These changes are in effect until the cluster restarts. Once -a full cluster restart takes place, these settings are erased. +Transient:: + These changes are in effect until the cluster restarts. Once a full cluster +restart takes place, these settings are erased. Persistent:: - These changes are permanently in place unless explicitly changed. -They will survive full cluster restarts and override the static configuration files. + These changes are permanently in place unless explicitly changed. They will +survive full cluster restarts and override the static configuration files. Transient versus persistent settings are supplied in the JSON body: @@ -31,9 +31,7 @@ PUT /_cluster/settings } ---- <1> This persistent setting will survive full cluster restarts. -<2> This transient setting will be removed after the first full cluster -restart. +<2> This transient setting will be removed after the first full cluster restart. A complete list of settings that can be updated dynamically can be found in the {ref}/cluster-update-settings.html[online reference docs]. - diff --git a/520_Post_Deployment/20_logging.asciidoc b/520_Post_Deployment/20_logging.asciidoc index 18cd2ce03..dd0ef7c2c 100644 --- a/520_Post_Deployment/20_logging.asciidoc +++ b/520_Post_Deployment/20_logging.asciidoc @@ -1,19 +1,20 @@ [[logging]] === Logging -Elasticsearch emits a number of logs, which are placed in `ES_HOME/logs`. -The default logging level is `INFO`. ((("post-deployment", "logging")))((("logging", "Elasticsearch logging"))) It provides a moderate amount of information, +Elasticsearch emits a number of logs, which are placed in `ES_HOME/logs`. The +default logging level is `INFO`. It provides a moderate amount of information, but is designed to be rather light so that your logs are not enormous. When debugging problems, particularly problems with node discovery (since this -often depends on finicky network configurations), it can be helpful to bump -up the logging level to `DEBUG`. +often depends on finicky network configurations), it can be helpful to bump up +the logging level to `DEBUG`. You _could_ modify the `logging.yml` file and restart your nodes--but that is -both tedious and leads to unnecessary downtime. Instead, you can update logging -levels through the `cluster-settings` API((("Cluster Settings API, updating logging levels"))) that we just learned about. +both tedious and leads to unnecessary downtime. Instead, you can update logging +levels through the `cluster-settings` API that we just learned about. -To do so, take the logger you are interested in and prepend `logger.` to it. You can refer to the root logger as `logger._root`. +To do so, take the logger you are interested in and prepend `logger.` to it. You +can refer to the root logger as `logger._root`. Let's turn up the discovery logging: @@ -30,19 +31,19 @@ PUT /_cluster/settings While this setting is in effect, Elasticsearch will begin to emit `DEBUG`-level logs for the `discovery` module. -TIP: Avoid `TRACE`. It is extremely verbose, to the point where the logs -are no longer useful. +TIP: Avoid `TRACE`. It is extremely verbose, to the point where the logs are no +longer useful. [[slowlog]] ==== Slowlog -There is another log called the _slowlog_. The purpose of((("Slowlog"))) this log is to catch -queries and indexing requests that take over a certain threshold of time. -It is useful for hunting down user-generated queries that are particularly slow. +There is another log called the _slowlog_. The purpose of this log is to catch +queries and indexing requests that take over a certain threshold of time. It is +useful for hunting down user-generated queries that are particularly slow. -By default, the slowlog is not enabled. It can be enabled by defining the action -(query, fetch, or index), the level that you want the event logged at (`WARN`, `DEBUG`, -and so forth) and a time threshold. +By default, the slowlog is not enabled. It can be enabled by defining the action +(query, fetch, or index), the level that you want the event logged at (`WARN`, +`DEBUG`, and so forth) and a time threshold. This is an index-level setting, which means it is applied to individual indices: @@ -59,7 +60,7 @@ PUT /my_index/_settings <2> Emit a `DEBUG` log when fetches are slower than 500ms. <3> Emit an `INFO` log when indexing takes longer than 5s. -You can also define these thresholds in your `elasticsearch.yml` file. Indices +You can also define these thresholds in your `elasticsearch.yml` file. Indices that do not have a threshold set will inherit whatever is configured in the static config. @@ -78,5 +79,3 @@ PUT /_cluster/settings ---- <1> Set the search slowlog to `DEBUG` level. <2> Set the indexing slowlog to `WARN` level. - - diff --git a/520_Post_Deployment/30_indexing_perf.asciidoc b/520_Post_Deployment/30_indexing_perf.asciidoc index 7a15fb459..4cd6e85f7 100644 --- a/520_Post_Deployment/30_indexing_perf.asciidoc +++ b/520_Post_Deployment/30_indexing_perf.asciidoc @@ -1,15 +1,15 @@ [[indexing-performance]] === Indexing Performance Tips -If you are in an indexing-heavy environment,((("indexing", "performance tips")))((("post-deployment", "indexing performance tips"))) such as indexing infrastructure -logs, you may be willing to sacrifice some search performance for faster indexing -rates. In these scenarios, searches tend to be relatively rare and performed -by people internal to your organization. They are willing to wait several -seconds for a search, as opposed to a consumer facing a search that must +If you are in an indexing-heavy environment, such as indexing infrastructure +logs, you may be willing to sacrifice some search performance for faster +indexing rates. In these scenarios, searches tend to be relatively rare and +performed by people internal to your organization. They are willing to wait +several seconds for a search, as opposed to a consumer facing a search that must return in milliseconds. -Because of this unique position, certain trade-offs can be made -that will increase your indexing performance. +Because of this unique position, certain trade-offs can be made that will +increase your indexing performance. .These Tips Apply Only to Elasticsearch 1.3+ **** @@ -25,87 +25,95 @@ older versions because of the presence of bugs or performance defects. ==== Test Performance Scientifically Performance testing is always difficult, so try to be as scientific as possible -in your approach.((("performance testing")))((("indexing", "performance tips", "performance testing"))) Randomly fiddling with knobs and turning on ingestion is not -a good way to tune performance. If there are too many _causes_, it is impossible -to determine which one had the best _effect_. A reasonable approach to testing is as follows: +in your approach. Randomly fiddling with knobs and turning on ingestion is not a +good way to tune performance. If there are too many _causes_, it is impossible +to determine which one had the best _effect_. A reasonable approach to testing +is as follows: 1. Test performance on a single node, with a single shard and no replicas. 2. Record performance under 100% default settings so that you have a baseline to measure against. 3. Make sure performance tests run for a long time (30+ minutes) so you can -evaluate long-term performance, not short-term spikes or latencies. Some events +evaluate long-term performance, not short-term spikes or latencies. Some events (such as segment merging, and GCs) won't happen right away, so the performance profile can change over time. -4. Begin making single changes to the baseline defaults. Test these rigorously, -and if performance improvement is acceptable, keep the setting and move on to the -next one. +4. Begin making single changes to the baseline defaults. Test these rigorously, +and if performance improvement is acceptable, keep the setting and move on to +the next one. ==== Using and Sizing Bulk Requests -This should be fairly obvious, but use bulk indexing requests for optimal performance.((("indexing", "performance tips", "bulk requests, using and sizing")))((("bulk API", "using and sizing bulk requests"))) -Bulk sizing is dependent on your data, analysis, and cluster configuration, but -a good starting point is 5–15 MB per bulk. Note that this is physical size. -Document count is not a good metric for bulk size. For example, if you are -indexing 1,000 documents per bulk, keep the following in mind: +This should be fairly obvious, but use bulk indexing requests for optimal +performance. Bulk sizing is dependent on your data, analysis, and cluster +configuration, but a good starting point is 5–15 MB per bulk. Note that +this is physical size. Document count is not a good metric for bulk size. For +example, if you are indexing 1,000 documents per bulk, keep the following in +mind: - 1,000 documents at 1 KB each is 1 MB. - 1,000 documents at 100 KB each is 100 MB. -Those are drastically different bulk sizes. Bulks need to be loaded into memory +Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count. -Start with a bulk size around 5–15 MB and slowly increase it until you do not -see performance gains anymore. Then start increasing the concurrency of your +Start with a bulk size around 5–15 MB and slowly increase it until you do +not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth). -Monitor your nodes with Marvel and/or tools such as `iostat`, `top`, and `ps` to see -when resources start to bottleneck. If you start to receive `EsRejectedExecutionException`, -your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes. +Monitor your nodes with Marvel and/or tools such as `iostat`, `top`, and `ps` to +see when resources start to bottleneck. If you start to receive +`EsRejectedExecutionException`, your cluster can no longer keep up: at least one +resource has reached capacity. Either reduce concurrency, provide more of the +limited resource (such as switching from spinning disks to SSDs), or add more +nodes. [NOTE] ==== When ingesting data, make sure bulk requests are round-robined across all your -data nodes. Do not send all requests to a single node, since that single node +data nodes. Do not send all requests to a single node, since that single node will need to store all the bulks in memory while processing. ==== ==== Storage -Disks are usually the bottleneck of any modern server. Elasticsearch heavily uses disks, and the more throughput your disks can handle, the more stable your nodes will be. Here are some tips for optimizing disk I/O: +Disks are usually the bottleneck of any modern server. Elasticsearch heavily +uses disks, and the more throughput your disks can handle, the more stable your +nodes will be. Here are some tips for optimizing disk I/O: -- Use SSDs. As mentioned elsewhere, ((("storage")))((("indexing", "performance tips", "storage")))they are superior to spinning media. -- Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of -potential failure if a drive dies. Don't use mirrored or parity RAIDS since +- Use SSDs. As mentioned elsewhere, they are superior to spinning media. +- Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of +potential failure if a drive dies. Don't use mirrored or parity RAIDS since replicas provide that functionality. -- Alternatively, use multiple drives and allow Elasticsearch to stripe data across -them via multiple `path.data` directories. -- Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced -here is antithetical to performance. -- If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower -than local instance storage. +- Alternatively, use multiple drives and allow Elasticsearch to stripe data +across them via multiple `path.data` directories. +- Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency +introduced here is antithetical to performance. +- If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often +slower than local instance storage. [[segments-and-merging]] ==== Segments and Merging -Segment merging is computationally expensive,((("indexing", "performance tips", "segments and merging")))((("merging segments")))((("segments", "merging"))) and can eat up a lot of disk I/O. +Segment merging is computationally expensive, and can eat up a lot of disk I/O. Merges are scheduled to operate in the background because they can take a long -time to finish, especially large segments. This is normally fine, because the +time to finish, especially large segments. This is normally fine, because the rate of large segment merges is relatively rare. -But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch -will automatically throttle indexing requests to a single thread. This prevents -a _segment explosion_ problem, in which hundreds of segments are generated before -they can be merged. Elasticsearch will log `INFO`-level messages stating `now -throttling indexing` when it detects merging falling behind indexing. +But sometimes merging falls behind the ingestion rate. If this happens, +Elasticsearch will automatically throttle indexing requests to a single thread. +This prevents a _segment explosion_ problem, in which hundreds of segments are +generated before they can be merged. Elasticsearch will log `INFO`-level +messages stating `now throttling indexing` when it detects merging falling +behind indexing. Elasticsearch defaults here are conservative: you don't want search performance -to be impacted by background merging. But sometimes (especially on SSD, or logging -scenarios), the throttle limit is too low. +to be impacted by background merging. But sometimes (especially on SSD, or +logging scenarios), the throttle limit is too low. -The default is 20 MB/s, which is a good setting for spinning disks. If you have -SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works -for your system: +The default is 20 MB/s, which is a good setting for spinning disks. If you have +SSDs, you might consider increasing this to 100–200 MB/s. Test to see +what works for your system: [source,js] ---- @@ -117,9 +125,9 @@ PUT /_cluster/settings } ---- -If you are doing a bulk import and don't care about search at all, you can disable -merge throttling entirely. This will allow indexing to run as fast as your -disks will allow: +If you are doing a bulk import and don't care about search at all, you can +disable merge throttling entirely. This will allow indexing to run as fast as +your disks will allow: [source,js] ---- @@ -130,7 +138,7 @@ PUT /_cluster/settings } } ---- -<1> Setting the throttle type to `none` disables merge throttling entirely. When +<1> Setting the throttle type to `none` disables merge throttling entirely. When you are done importing, set it back to `merge` to reenable throttling. If you are using spinning media instead of SSD, you need to add this to your @@ -141,47 +149,50 @@ If you are using spinning media instead of SSD, you need to add this to your index.merge.scheduler.max_thread_count: 1 ---- -Spinning media has a harder time with concurrent I/O, so we need to decrease -the number of threads that can concurrently access the disk per index. This setting -will allow `max_thread_count + 2` threads to operate on the disk at one time, -so a setting of `1` will allow three threads. +Spinning media has a harder time with concurrent I/O, so we need to decrease the +number of threads that can concurrently access the disk per index. This setting +will allow `max_thread_count + 2` threads to operate on the disk at one time, so +a setting of `1` will allow three threads. For SSDs, you can ignore this setting. The default is `Math.min(3, Runtime.getRuntime().availableProcessors() / 2)`, which works well for SSD. Finally, you can increase `index.translog.flush_threshold_size` from the default -512 MB to something larger, such as 1 GB. This allows larger segments to accumulate -in the translog before a flush occurs. By letting larger segments build, you -flush less often, and the larger segments merge less often. All of this adds up -to less disk I/O overhead and better indexing rates. Of course, you will need -the corresponding amount of heap memory free to accumulate the extra buffering -space, so keep that in mind when adjusting this setting. +512 MB to something larger, such as 1 GB. This allows larger segments to +accumulate in the translog before a flush occurs. By letting larger segments +build, you flush less often, and the larger segments merge less often. All of +this adds up to less disk I/O overhead and better indexing rates. Of course, you +will need the corresponding amount of heap memory free to accumulate the extra +buffering space, so keep that in mind when adjusting this setting. ==== Other Finally, there are some other considerations to keep in mind: - If you don't need near real-time accuracy on your search results, consider -dropping the `index.refresh_interval` of((("indexing", "performance tips", "other considerations")))((("refresh_interval setting"))) each index to `30s`. If you are doing -a large import, you can disable refreshes by setting this value to `-1` for the -duration of the import. Don't forget to reenable it when you are finished! +dropping the `index.refresh_interval` of each index to `30s`. If you are doing a +large import, you can disable refreshes by setting this value to `-1` for the +duration of the import. Don't forget to reenable it when you are finished! - If you are doing a large bulk import, consider disabling replicas by setting -`index.number_of_replicas: 0`.((("replicas, disabling during large bulk imports"))) When documents are replicated, the entire document -is sent to the replica node and the indexing process is repeated verbatim. This -means each replica will perform the analysis, indexing, and potentially merging -process. +`index.number_of_replicas: 0`. When documents are replicated, the entire +document is sent to the replica node and the indexing process is repeated +verbatim. This means each replica will perform the analysis, indexing, and +potentially merging process. + -In contrast, if you index with zero replicas and then enable replicas when ingestion -is finished, the recovery process is essentially a byte-for-byte network transfer. -This is much more efficient than duplicating the indexing process. +In contrast, if you index with zero replicas and then enable replicas when +ingestion is finished, the recovery process is essentially a byte-for-byte +network transfer. This is much more efficient than duplicating the indexing +process. - If you don't have a natural ID for each document, use Elasticsearch's auto-ID -functionality.((("id", "auto-ID functionality of Elasticsearch"))) It is optimized to avoid version lookups, since the autogenerated +functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique. -- If you are using your own ID, try to pick an ID that is http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[friendly to Lucene]. ((("UUIDs (universally unique identifiers)"))) Examples include zero-padded -sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential -patterns that compress well. In contrast, IDs such as UUID-4 are essentially -random, which offer poor compression and slow down Lucene. +- If you are using your own ID, try to pick an ID that is +http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[friendly +to Lucene]. ((("UUIDs (universally unique identifiers)"))) Examples include +zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, +sequential patterns that compress well. In contrast, IDs such as UUID-4 are +essentially random, which offer poor compression and slow down Lucene. diff --git a/520_Post_Deployment/35_delayed_shard_allocation.asciidoc b/520_Post_Deployment/35_delayed_shard_allocation.asciidoc index c59bda365..de7f9b98a 100644 --- a/520_Post_Deployment/35_delayed_shard_allocation.asciidoc +++ b/520_Post_Deployment/35_delayed_shard_allocation.asciidoc @@ -1,48 +1,49 @@ === Delaying Shard Allocation -As discussed way back in <<_scale_horizontally>>, Elasticsearch will automatically -balance shards between your available nodes, both when new nodes are added and -when existing nodes leave. +As discussed way back in <<_scale_horizontally>>, Elasticsearch will +automatically balance shards between your available nodes, both when new nodes +are added and when existing nodes leave. -Theoretically, this is the best thing to do. We want to recover missing primaries -by promoting replicas as soon as possible. We also want to make sure resources -are balanced evenly across the cluster to prevent hotspots. +Theoretically, this is the best thing to do. We want to recover missing +primaries by promoting replicas as soon as possible. We also want to make sure +resources are balanced evenly across the cluster to prevent hotspots. -In practice, however, immediately re-balancing can cause more problems than it solves. -For example, consider this situation: +In practice, however, immediately re-balancing can cause more problems than it +solves. For example, consider this situation: -1. Node 19 loses connectivity to your network (someone tripped on the power cable) -2. Immediately, the master notices the node departure. It determines -what primary shards were on Node 19 and promotes the corresponding replicas around +1. Node 19 loses connectivity to your network (someone tripped on the power +cable) +2. Immediately, the master notices the node departure. It determines what +primary shards were on Node 19 and promotes the corresponding replicas around the cluster -3. After replicas have been promoted to primary, the master begins issuing recovery -commands to rebuild the now-missing replicas. Nodes around the cluster fire up -their NICs and start pumping shard data to each other in an attempt to get back -to green health status +3. After replicas have been promoted to primary, the master begins issuing +recovery commands to rebuild the now-missing replicas. Nodes around the cluster +fire up their NICs and start pumping shard data to each other in an attempt to +get back to green health status 4. This process will likely trigger a small cascade of shard movement, since the -cluster is now unbalanced. Unrelated shards will be moved between hosts to accomplish -better balancing +cluster is now unbalanced. Unrelated shards will be moved between hosts to +accomplish better balancing Meanwhile, the hapless admin who kicked out the power cable plugs it back in. -Node 19 reboots and rejoins the cluster. Unfortunately, the node is informed that -its existing data is now useless; the data being re-allocated elsewhere. -So Node 19 deletes its local data and begins recovering a different -set of shards from the cluster (which then causes a new minor re-balancing dance). +Node 19 reboots and rejoins the cluster. Unfortunately, the node is informed +that its existing data is now useless; the data being re-allocated elsewhere. So +Node 19 deletes its local data and begins recovering a different set of shards +from the cluster (which then causes a new minor re-balancing dance). -If this all sounds needless and expensive, you're right. It is, but _only when -you know the node will be back soon_. If Node 19 was truly gone, the above procedure -is exactly what we want to happen. +If this all sounds needless and expensive, you're right. It is, but _only when +you know the node will be back soon_. If Node 19 was truly gone, the above +procedure is exactly what we want to happen. To help address these transient outages, Elasticsearch has the ability to delay -shard allocation. This gives your cluster time to see if nodes will rejoin before -starting the re-balancing dance. +shard allocation. This gives your cluster time to see if nodes will rejoin +before starting the re-balancing dance. ==== Changing the default delay -By default, the cluster will wait one minute to see if the node will rejoin. If -the node rejoins before the timer expires, the rejoining node will use its existing -shards and no shard allocation occurs. +By default, the cluster will wait one minute to see if the node will rejoin. If +the node rejoins before the timer expires, the rejoining node will use its +existing shards and no shard allocation occurs. This default time can be changed either globally, or on a per-index basis, by configuring the `delayed_timeout` setting: @@ -56,31 +57,31 @@ PUT /_all/_settings <1> } } ---- -<1> By using the `_all` index name, we can apply this setting to all indices -in the cluster +<1> By using the `_all` index name, we can apply this setting to all indices in +the cluster <2> The default time is changed to 5 minutes -The setting is dynamic and can be changed at runtime. If you would like shards to -allocate immediately instead of waiting, you can set `delayed_timeout: 0`. +The setting is dynamic and can be changed at runtime. If you would like shards +to allocate immediately instead of waiting, you can set `delayed_timeout: 0`. -NOTE: Delayed allocation won't prevent replicas from being promoted to primaries. -The cluster will still perform promotions as necessary to get the cluster back to -`yellow` status. The allocation of the now-missing replicas will be the only process -that is delayed +NOTE: Delayed allocation won't prevent replicas from being promoted to +primaries. The cluster will still perform promotions as necessary to get the +cluster back to `yellow` status. The allocation of the now-missing replicas will +be the only process that is delayed ==== Auto-cancellation of shard relocation -What happens if the node comes back _after_ the timeout expires, but before -the cluster has finished moving shards around? In this case, Elasticsearch will -check to see if the on-disk data matches the current "live" data in the primary shard. -If the two shards are identical -- meaning there have been no new documents, updates -or deletes -- the master will cancel the on-going rebalancing and restore the -on-disk data. +What happens if the node comes back _after_ the timeout expires, but before the +cluster has finished moving shards around? In this case, Elasticsearch will +check to see if the on-disk data matches the current "live" data in the primary +shard. If the two shards are identical -- meaning there have been no new +documents, updates or deletes -- the master will cancel the on-going rebalancing +and restore the on-disk data. -This is done since recovery of on-disk data will always be faster -than transferring over the network, and since we can guarantee the shards are identical, -the process is a win-win. +This is done since recovery of on-disk data will always be faster than +transferring over the network, and since we can guarantee the shards are +identical, the process is a win-win. If the shards have diverged (e.g. new documents have been indexed since the node -went down), the recovery process will continue as normal. The rejoining node +went down), the recovery process will continue as normal. The rejoining node will delete it's local, out-dated shards and obtain a new set. diff --git a/520_Post_Deployment/40_rolling_restart.asciidoc b/520_Post_Deployment/40_rolling_restart.asciidoc index 77076b0b0..775fbda80 100644 --- a/520_Post_Deployment/40_rolling_restart.asciidoc +++ b/520_Post_Deployment/40_rolling_restart.asciidoc @@ -3,34 +3,35 @@ There will come a time when you need to perform a rolling restart of your cluster--keeping the cluster online and operational, but taking nodes offline -one at a time.((("rolling restart of your cluster")))((("clusters", "rolling restarts")))((("post-deployment", "rolling restarts"))) +one at a time. The common reason is either an Elasticsearch version upgrade, or some kind of -maintenance on the server itself (such as an OS update, or hardware). Whatever the case, -there is a particular method to perform a rolling restart. +maintenance on the server itself (such as an OS update, or hardware). Whatever +the case, there is a particular method to perform a rolling restart. -By nature, Elasticsearch wants your data to be fully replicated and evenly balanced. -If you shut down a single node for maintenance, the cluster will -immediately recognize the loss of a node and begin rebalancing. This can be irritating -if you know the node maintenance is short term, since the rebalancing of -very large shards can take some time (think of trying to replicate 1TB--even +By nature, Elasticsearch wants your data to be fully replicated and evenly +balanced. If you shut down a single node for maintenance, the cluster will +immediately recognize the loss of a node and begin rebalancing. This can be +irritating if you know the node maintenance is short term, since the rebalancing +of very large shards can take some time (think of trying to replicate 1TB--even on fast networks this is nontrivial). -What we want to do is tell Elasticsearch to hold off on rebalancing, because -we have more knowledge about the state of the cluster due to external factors. -The procedure is as follows: +What we want to do is tell Elasticsearch to hold off on rebalancing, because we +have more knowledge about the state of the cluster due to external factors. The +procedure is as follows: -1. If possible, stop indexing new data and perform a synced flush. This is not always possible, but will -help speed up recovery time. -A synced flush request is a “best effort” operation. It will fail if there are any pending indexing operations, but it is safe to reissue the request multiple times if necessary. +1. If possible, stop indexing new data and perform a synced flush. This is not +always possible, but will help speed up recovery time. A synced flush request is +a “best effort” operation. It will fail if there are any pending indexing +operations, but it is safe to reissue the request multiple times if necessary. + [source,js] ---- POST /_flush/synced ---- -2. Disable shard allocation. This prevents Elasticsearch from rebalancing -missing shards until you tell it otherwise. If you know the maintenance window will be -short, this is a good idea. You can disable allocation as follows: +2. Disable shard allocation. This prevents Elasticsearch from rebalancing +missing shards until you tell it otherwise. If you know the maintenance window +will be short, this is a good idea. You can disable allocation as follows: + [source,js] ---- @@ -57,12 +58,11 @@ PUT /_cluster/settings } ---- + -Shard rebalancing may take some time. Wait until the cluster has returned -to status `green` before continuing. +Shard rebalancing may take some time. Wait until the cluster has returned to +status `green` before continuing. 7. Repeat steps 2 through 6 for the rest of your nodes. -8. At this point you are safe to resume indexing (if you had previously stopped), -but waiting until the cluster is fully balanced before resuming indexing will help -to speed up the process. - +8. At this point you are safe to resume indexing (if you had previously +stopped), but waiting until the cluster is fully balanced before resuming +indexing will help to speed up the process. diff --git a/520_Post_Deployment/50_backup.asciidoc b/520_Post_Deployment/50_backup.asciidoc index b7b92efb7..709c5a7af 100644 --- a/520_Post_Deployment/50_backup.asciidoc +++ b/520_Post_Deployment/50_backup.asciidoc @@ -2,19 +2,19 @@ === Backing Up Your Cluster As with any software that stores data, it is important to routinely back up your -data. ((("clusters", "backing up")))((("post-deployment", "backing up your cluster")))((("backing up your cluster"))) Elasticsearch replicas provide high availability during runtime; they allow -you to tolerate sporadic node loss without an interruption of service. +data. Elasticsearch replicas provide high availability during runtime; they +allow you to tolerate sporadic node loss without an interruption of service. -Replicas do not provide protection from catastrophic failure, however. For that, +Replicas do not provide protection from catastrophic failure, however. For that, you need a real backup of your cluster--a complete copy in case something goes wrong. -To back up your cluster, you can use the `snapshot` API.((("snapshot-restore API"))) This will take the current -state and data in your cluster and save it to a shared repository. This -backup process is "smart." Your first snapshot will be a complete copy of data, +To back up your cluster, you can use the `snapshot` API. This will take the +current state and data in your cluster and save it to a shared repository. This +backup process is "smart." Your first snapshot will be a complete copy of data, but all subsequent snapshots will save the _delta_ between the existing -snapshots and the new data. Data is incrementally added and deleted as you snapshot -data over time. This means subsequent backups will be substantially +snapshots and the new data. Data is incrementally added and deleted as you +snapshot data over time. This means subsequent backups will be substantially faster since they are transmitting far less data. To use this functionality, you must first create a repository to save data. @@ -27,7 +27,7 @@ There are several repository types that you may choose from: ==== Creating the Repository -Let's set up a shared ((("backing up your cluster", "creating the repository")))((("filesystem repository")))filesystem repository: +Let's set up a shared filesystem repository: [source,js] ---- @@ -46,18 +46,18 @@ PUT _snapshot/my_backup <1> NOTE: The shared filesystem path must be accessible from all nodes in your cluster! -This will create the repository and required metadata at the mount point. There +This will create the repository and required metadata at the mount point. There are also some other options that you may want to configure, depending on the performance profile of your nodes, network, and repository location: `max_snapshot_bytes_per_sec`:: - When snapshotting data into the repo, this controls -the throttling of that process. The default is `20mb` per second. + When snapshotting data into the repo, this controls the throttling of that +process. The default is `20mb` per second. `max_restore_bytes_per_sec`:: -When restoring data from the repo, this controls -how much the restore is throttled so that your network is not saturated. The -default is `20mb` per second. + When restoring data from the repo, this controls how much the restore is +throttled so that your network is not saturated. The default is `20mb` per +second. Let's assume we have a very fast network and are OK with extra traffic, so we can increase the defaults: @@ -74,16 +74,16 @@ POST _snapshot/my_backup/ <1> } } ---- -<1> Note that we are using a `POST` instead of `PUT`. This will update the settings -of the existing repository. +<1> Note that we are using a `POST` instead of `PUT`. This will update the +settings of the existing repository. <2> Then add our new settings. ==== Snapshotting All Open Indices -A repository can contain multiple snapshots.((("indices", "open, snapshots on")))((("backing up your cluster", "snapshots on all open indexes"))) Each snapshot is associated with a -certain set of indices (for example, all indices, some subset, or a single index). When -creating a snapshot, you specify which indices you are interested in and -give the snapshot a unique name. +A repository can contain multiple snapshots. Each snapshot is associated with a +certain set of indices (for example, all indices, some subset, or a single +index). When creating a snapshot, you specify which indices you are interested +in and give the snapshot a unique name. Let's start with the most basic snapshot command: @@ -93,33 +93,34 @@ PUT _snapshot/my_backup/snapshot_1 ---- This will back up all open indices into a snapshot named `snapshot_1`, under the -`my_backup` repository. This call will return immediately, and the snapshot will +`my_backup` repository. This call will return immediately, and the snapshot will proceed in the background. [TIP] ================================================== -Usually you'll want your snapshots to proceed as a background process, but occasionally -you may want to wait for completion in your script. This can be accomplished by -adding a `wait_for_completion` flag: +Usually you'll want your snapshots to proceed as a background process, but +occasionally you may want to wait for completion in your script. This can be +accomplished by adding a `wait_for_completion` flag: [source,js] ---- PUT _snapshot/my_backup/snapshot_1?wait_for_completion=true ---- -This will block the call until the snapshot has completed. Note that large snapshots -may take a long time to return! +This will block the call until the snapshot has completed. Note that large +snapshots may take a long time to return! ================================================== ==== Snapshotting Particular Indices -The default behavior is to back up all open indices.((("indices", "snapshotting particular")))((("backing up your cluster", "snapshotting particular indices"))) But say you are using Marvel, -and don't really want to back up all the diagnostic `.marvel` indices. You -just don't have enough space to back up everything. +The default behavior is to back up all open indices. But say you are using +Marvel, and don't really want to back up all the diagnostic `.marvel` indices. +You just don't have enough space to back up everything. -In that case, you can specify which indices to back up when snapshotting your cluster: +In that case, you can specify which indices to back up when snapshotting your +cluster: [source,js] ---- @@ -133,12 +134,12 @@ This snapshot command will now back up only `index1` and `index2`. ==== Listing Information About Snapshots -Once you start accumulating snapshots in your repository, you may forget the details((("backing up your cluster", "listing information about snapshots"))) -relating to each--particularly when the snapshots are named based on time -demarcations (for example, `backup_2014_10_28`). +Once you start accumulating snapshots in your repository, you may forget the +details relating to each--particularly when the snapshots are named based on +time demarcations (for example, `backup_2014_10_28`). -To obtain information about a single snapshot, simply issue a `GET` request against -the repo and snapshot name: +To obtain information about a single snapshot, simply issue a `GET` request +against the repo and snapshot name: [source,js] ---- @@ -176,8 +177,8 @@ the snapshot: } ---- -For a complete listing of all snapshots in a repository, use the `_all` placeholder -instead of a snapshot name: +For a complete listing of all snapshots in a repository, use the `_all` +placeholder instead of a snapshot name: [source,js] ---- @@ -186,7 +187,7 @@ GET _snapshot/my_backup/_all ==== Deleting Snapshots -Finally, we need a command to delete old snapshots that ((("backing up your cluster", "deleting old snapshots")))are no longer useful. +Finally, we need a command to delete old snapshots that are no longer useful. This is simply a `DELETE` HTTP call to the repo/snapshot name: [source,js] @@ -195,10 +196,10 @@ DELETE _snapshot/my_backup/snapshot_2 ---- It is important to use the API to delete snapshots, and not some other mechanism -(such as deleting by hand, or using automated cleanup tools on S3). Because snapshots are -incremental, it is possible that many snapshots are relying on old segments. -The `delete` API understands what data is still in use by more recent snapshots, -and will delete only unused segments. +(such as deleting by hand, or using automated cleanup tools on S3). Because +snapshots are incremental, it is possible that many snapshots are relying on old +segments. The `delete` API understands what data is still in use by more recent +snapshots, and will delete only unused segments. If you do a manual file delete, however, you are at risk of seriously corrupting your backups because you are deleting data that is still in use. @@ -207,11 +208,12 @@ your backups because you are deleting data that is still in use. ==== Monitoring Snapshot Progress The `wait_for_completion` flag provides a rudimentary form of monitoring, but -really isn't sufficient when snapshotting or restoring even moderately sized clusters. +really isn't sufficient when snapshotting or restoring even moderately sized +clusters. -Two other APIs will give you more-detailed status about the -state of the snapshotting. First you can execute a `GET` to the snapshot ID, -just as we did earlier get information about a particular snapshot: +Two other APIs will give you more-detailed status about the state of the +snapshotting. First you can execute a `GET` to the snapshot ID, just as we did +earlier get information about a particular snapshot: [source,js] ---- @@ -219,10 +221,10 @@ GET _snapshot/my_backup/snapshot_3 ---- If the snapshot is still in progress when you call this, you'll see information -about when it was started, how long it has been running, and so forth. Note, however, -that this API uses the same threadpool as the snapshot mechanism. If you are -snapshotting very large shards, the time between status updates can be quite large, -since the API is competing for the same threadpool resources. +about when it was started, how long it has been running, and so forth. Note, +however, that this API uses the same threadpool as the snapshot mechanism. If +you are snapshotting very large shards, the time between status updates can be +quite large, since the API is competing for the same threadpool resources. A better option is to poll the `_status` API: @@ -291,35 +293,38 @@ statistics: ... ---- <1> A snapshot that is currently running will show `IN_PROGRESS` as its status. -<2> This particular snapshot has one shard still transferring (the other four have already completed). +<2> This particular snapshot has one shard still transferring (the other four +have already completed). -The response includes the overall status of the snapshot, but also drills down into -per-index and per-shard statistics. This gives you an incredibly detailed view -of how the snapshot is progressing. Shards can be in various states of completion: +The response includes the overall status of the snapshot, but also drills down +into per-index and per-shard statistics. This gives you an incredibly detailed +view of how the snapshot is progressing. Shards can be in various states of +completion: `INITIALIZING`:: - The shard is checking with the cluster state to see whether it can -be snapshotted. This is usually very fast. + The shard is checking with the cluster state to see whether it can be +snapshotted. This is usually very fast. `STARTED`:: Data is being transferred to the repository. - + `FINALIZING`:: Data transfer is complete; the shard is now sending snapshot metadata. - + `DONE`:: Snapshot complete! - + `FAILED`:: - An error was encountered during the snapshot process, and this shard/index/snapshot -could not be completed. Check your logs for more information. + An error was encountered during the snapshot process, and this +shard/index/snapshot could not be completed. Check your logs for more +information. ==== Canceling a Snapshot -Finally, you may want to cancel a snapshot or restore.((("backing up your cluster", "canceling a snapshot"))) Since these are long-running -processes, a typo or mistake when executing the operation could take a long time to -resolve--and use up valuable resources at the same time. +Finally, you may want to cancel a snapshot or restore. Since these are +long-running processes, a typo or mistake when executing the operation could +take a long time to resolve--and use up valuable resources at the same time. To cancel a snapshot, simply delete the snapshot while it is in progress: @@ -330,5 +335,3 @@ DELETE _snapshot/my_backup/snapshot_3 This will halt the snapshot process. Then proceed to delete the half-completed snapshot from the repository. - - diff --git a/520_Post_Deployment/60_restore.asciidoc b/520_Post_Deployment/60_restore.asciidoc index a4dd37f45..5082adef3 100644 --- a/520_Post_Deployment/60_restore.asciidoc +++ b/520_Post_Deployment/60_restore.asciidoc @@ -1,23 +1,23 @@ === Restoring from a Snapshot -Once you've backed up some data, restoring it is easy: simply add `_restore` -to the ID of((("post-deployment", "restoring from a snapshot")))((("restoring from a snapshot"))) the snapshot you wish to restore into your cluster: +Once you've backed up some data, restoring it is easy: simply add `_restore` to +the ID of the snapshot you wish to restore into your cluster: [source,js] ---- POST _snapshot/my_backup/snapshot_1/_restore ---- -The default behavior is to restore all indices that exist in that snapshot. -If `snapshot_1` contains five indices, all five will be restored into -our cluster. ((("indices", "restoring from a snapshot"))) As with the `snapshot` API, it is possible to select which indices -we want to restore. +The default behavior is to restore all indices that exist in that snapshot. If +`snapshot_1` contains five indices, all five will be restored into our cluster. +As with the `snapshot` API, it is possible to select which indices we want to +restore. -There are also additional options for renaming indices. This allows you to -match index names with a pattern, and then provide a new name during the restore process. -This is useful if you want to restore old data to verify its contents, or perform -some other processing, without replacing existing data. Let's restore +There are also additional options for renaming indices. This allows you to match +index names with a pattern, and then provide a new name during the restore +process. This is useful if you want to restore old data to verify its contents, +or perform some other processing, without replacing existing data. Let's restore a single index from the snapshot and provide a replacement name: [source,js] @@ -34,15 +34,16 @@ snapshot. <2> Find any indices being restored that match the provided pattern. <3> Then rename them with the replacement pattern. -This will restore `index_1` into your cluster, but rename it to `restored_index_1`. +This will restore `index_1` into your cluster, but rename it to +`restored_index_1`. [TIP] ================================================== Similar to snapshotting, the `restore` command will return immediately, and the -restoration process will happen in the background. If you would prefer your HTTP -call to block until the restore is finished, simply add the `wait_for_completion` -flag: +restoration process will happen in the background. If you would prefer your HTTP +call to block until the restore is finished, simply add the +`wait_for_completion` flag: [source,js] ---- @@ -55,8 +56,8 @@ POST _snapshot/my_backup/snapshot_1/_restore?wait_for_completion=true ==== Monitoring Restore Operations The restoration of data from a repository piggybacks on the existing recovery -mechanisms already in place in Elasticsearch.((("restoring from a snapshot", "monitoring restore operations"))) Internally, recovering shards -from a repository is identical to recovering from another node. +mechanisms already in place in Elasticsearch. Internally, recovering shards from +a repository is identical to recovering from another node. If you wish to monitor the progress of a restore, you can use the `recovery` API. This is a general-purpose API that shows the status of shards moving around @@ -69,8 +70,8 @@ The API can be invoked for the specific indices that you are recovering: GET restored_index_3/_recovery ---- -Or for all indices in your cluster, which may include other shards moving around, -unrelated to your restore process: +Or for all indices in your cluster, which may include other shards moving +around, unrelated to your restore process: [source,js] ---- @@ -134,18 +135,18 @@ depending on the activity of your cluster!): recovered from a snapshot. <2> The `source` hash describes the particular snapshot and repository that is being recovered from. -<3> The `percent` field gives you an idea about the status of the recovery. -This particular shard has recovered 94% of the files so far; it is almost complete. +<3> The `percent` field gives you an idea about the status of the recovery. This +particular shard has recovered 94% of the files so far; it is almost complete. -The output will list all indices currently undergoing a recovery, and then -list all shards in each of those indices. Each shard will have stats -about start/stop time, duration, recover percentage, bytes transferred, and more. +The output will list all indices currently undergoing a recovery, and then list +all shards in each of those indices. Each shard will have stats about start/stop +time, duration, recover percentage, bytes transferred, and more. ==== Canceling a Restore -To cancel a restore, you need to delete the indices being restored.((("restoring from a snapshot", "canceling a restore"))) Because -a restore process is really just shard recovery, issuing a `delete-index` API -alters the cluster state, which will in turn halt recovery. For example: +To cancel a restore, you need to delete the indices being restored. Because a +restore process is really just shard recovery, issuing a `delete-index` API +alters the cluster state, which will in turn halt recovery. For example: [source,js] ---- @@ -155,7 +156,3 @@ DELETE /restored_index_3 If `restored_index_3` was actively being restored, this delete command would halt the restoration as well as deleting any data that had already been restored into the cluster. - - - - diff --git a/520_Post_Deployment/70_conclusion.asciidoc b/520_Post_Deployment/70_conclusion.asciidoc index e51a7065c..993076e34 100644 --- a/520_Post_Deployment/70_conclusion.asciidoc +++ b/520_Post_Deployment/70_conclusion.asciidoc @@ -2,23 +2,24 @@ === Clusters Are Living, Breathing Creatures Once you get a cluster into production, you'll find that it takes on a life of its -own. ((("clusters", "maintaining")))((("post-deployment", "clusters, rolling restarts and upgrades")))Elasticsearch works hard to make clusters self-sufficient and _just work_. -But a cluster still requires routine care and feeding, such as routine backups +own. Elasticsearch works hard to make clusters self-sufficient and _just work_. +But a cluster still requires routine care and feeding, such as routine backups and upgrades. -Elasticsearch releases new versions with bug fixes and performance enhancements at -a very fast pace, and it is always a good idea to keep your cluster current. -Similarly, Lucene continues to find new and exciting bugs in the JVM itself, which -means you should always try to keep your JVM up-to-date. +Elasticsearch releases new versions with bug fixes and performance enhancements +at a very fast pace, and it is always a good idea to keep your cluster current. +Similarly, Lucene continues to find new and exciting bugs in the JVM itself, +which means you should always try to keep your JVM up-to-date. -This means it is a good idea to have a standardized, routine way to perform rolling -restarts and upgrades in your cluster. Upgrading should be a routine process, -rather than a once-yearly fiasco that requires countless hours of precise planning. +This means it is a good idea to have a standardized, routine way to perform +rolling restarts and upgrades in your cluster. Upgrading should be a routine +process, rather than a once-yearly fiasco that requires countless hours of +precise planning. -Similarly, it is important to have disaster recovery plans in place. Take frequent -snapshots of your cluster--and periodically _test_ those snapshots by performing -a real recovery! It is all too common for organizations to make routine backups but -never test their recovery strategy. Often you'll find a glaring deficiency -the first time you perform a real recovery (such as users being unaware of which -drive to mount). It's better to work these bugs out of your process with -routine testing, rather than at 3 a.m. when there is a crisis. +Similarly, it is important to have disaster recovery plans in place. Take +frequent snapshots of your cluster--and periodically _test_ those snapshots by +performing a real recovery! It is all too common for organizations to make +routine backups but never test their recovery strategy. Often you'll find a +glaring deficiency the first time you perform a real recovery (such as users +being unaware of which drive to mount). It's better to work these bugs out of +your process with routine testing, rather than at 3 a.m. when there is a crisis. From 2aecf882e0ffda05196a494bce72ae9a6907e9ce Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Fri, 28 Apr 2017 07:58:40 +0200 Subject: [PATCH 081/107] Implement basic timestamp generator --- scripts/300_Aggregations/generate.py | 48 ++++++++++++++++++++++++++-- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/scripts/300_Aggregations/generate.py b/scripts/300_Aggregations/generate.py index 0866415ac..5f99765f3 100755 --- a/scripts/300_Aggregations/generate.py +++ b/scripts/300_Aggregations/generate.py @@ -3,6 +3,8 @@ import json import sys import random +import datetime +import math vendors = [ "Yellow", @@ -114,6 +116,11 @@ def distance(start, end): return base + 0.2 * random.randint(0, max(base, 1)) +def duration(dist_miles): + # 15 - 25 miles per hour on average + return datetime.timedelta(hours=dist_miles / random.randint(15, 25)) + + def fare(trip_distance): # loosely based on https://www.sfmta.com/getting-around/taxi/taxi-rates # assume a random waiting time up to 10% of the distance @@ -126,27 +133,62 @@ def tip(fare_amount): # up to 20% tip return 0.2 * random.randint(0, round(fare_amount)) + def round_f(v): return float("{0:.2f}".format(v)) +def generate_timestamp(current): + h = current.hour + week_day = current.isoweekday() + + hours_per_day = 24 + + peak_hour = 12 + max_difference_hours = hours_per_day - peak_hour + + if week_day < 6: + max_rides_per_hour = 1000 + min_rides_per_hour = 100 + elif week_day == 6: + max_rides_per_hour = 800 + min_rides_per_hour = 200 + else: + max_rides_per_hour = 600 + min_rides_per_hour = 50 + + diff_from_peak_hour = peak_hour - h if h <= peak_hour else h - peak_hour + # vary the targeted rides per hour between [min_rides_per_hour; max_rides_per_hour] depending on difference to peak hour according to + # a sine function to smooth it a bit. + traffic_scale_factor = math.sin(0.5 * math.pi * (max_difference_hours - diff_from_peak_hour) / max_difference_hours) + target_rides_this_hour = min_rides_per_hour + (max_rides_per_hour - min_rides_per_hour) * traffic_scale_factor + + increment = random.expovariate(target_rides_this_hour) * 3600 + return current + datetime.timedelta(seconds=increment) + + +def format_ts(ts): + return ts.strftime("%Y-%m-%d %H:%M:%S") + + def main(): if len(sys.argv) != 2: print("usage: %s number_of_records_to_generate" % sys.argv[0]) exit(1) + current = datetime.datetime(year=2017, month=4, day=1) num_records = int(sys.argv[1]) for i in range(num_records): record = {} + current = generate_timestamp(current) record["vendor"] = vendor() - # TODO: Find a simple but somewhat realistic model for daily / weekly patterns - # record["pickup_datetime"] = pickup_datetime - # record["dropoff_datetime"] = dropoff_datetime + record["pickup_datetime"] = format_ts(current) record["passenger_count"] = passengers() start = random.choice(zones) end = random.choice(zones) trip_distance = distance(start, end) + record["dropoff_datetime"] = format_ts(current + duration(trip_distance)) record["pickup_zone"] = start record["dropoff_zone"] = end From b8b900d432535e2a449a3ed90d6156e5d42faf54 Mon Sep 17 00:00:00 2001 From: Josh Rich Date: Mon, 1 May 2017 15:30:10 +1000 Subject: [PATCH 082/107] Update Getting Started with Languages: * Remove O'Reilly index cruft * CONSOLE-ify code snippets * Fix deprecated Elasticsearch DSL syntax * Where possible, add example documents so mapping changes and searches can be properly observed * Fix line wraps --- 200_Language_intro/00_Intro.asciidoc | 21 ++--- 200_Language_intro/10_Using.asciidoc | 42 ++++++---- 200_Language_intro/20_Configuring.asciidoc | 18 +++-- .../30_Language_pitfalls.asciidoc | 30 +++---- .../40_One_language_per_doc.asciidoc | 71 ++++++++-------- .../50_One_language_per_field.asciidoc | 55 ++++++++----- .../60_Mixed_language_fields.asciidoc | 81 +++++++++++++------ 7 files changed, 189 insertions(+), 129 deletions(-) diff --git a/200_Language_intro/00_Intro.asciidoc b/200_Language_intro/00_Intro.asciidoc index 6f8e105fc..0d0c34440 100644 --- a/200_Language_intro/00_Intro.asciidoc +++ b/200_Language_intro/00_Intro.asciidoc @@ -2,16 +2,10 @@ == Getting Started with Languages Elasticsearch ships with a collection of language analyzers that provide -good, basic, out-of-the-box ((("language analyzers")))((("languages", "getting started with")))support for many of the world's most common -languages: +good, basic, https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html[out-of-the-box support] +for many of the world's most common languages. -Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, -Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, -Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, -Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, -Turkish, and Thai. - -These analyzers typically((("language analyzers", "roles performed by"))) perform four roles: +These analyzers typically perform four roles: * Tokenize text into individual words: + @@ -30,19 +24,18 @@ These analyzers typically((("language analyzers", "roles performed by"))) perfor `foxes` -> `fox` Each analyzer may also apply other transformations specific to its language in -order to make words from that((("language analyzers", "other transformations specific to the language"))) language more searchable: +order to make words from that language more searchable: -* The `english` analyzer ((("english analyzer")))removes the possessive `'s`: +* The `english` analyzer removes the possessive `'s`: + `John's` -> `john` -* The `french` analyzer ((("french analyzer")))removes _elisions_ like `l'` and `qu'` and +* The `french` analyzer removes _elisions_ like `l'` and `qu'` and _diacritics_ like `¨` or `^`: + `l'église` -> `eglis` -* The `german` analyzer normalizes((("german analyzer"))) terms, replacing `ä` and `ae` with `a`, or +* The `german` analyzer normalizes terms, replacing `ä` and `ae` with `a`, or `ß` with `ss`, among others: + `äußerst` -> `ausserst` - diff --git a/200_Language_intro/10_Using.asciidoc b/200_Language_intro/10_Using.asciidoc index 005c0b195..459b25cd6 100644 --- a/200_Language_intro/10_Using.asciidoc +++ b/200_Language_intro/10_Using.asciidoc @@ -2,7 +2,7 @@ === Using Language Analyzers The built-in language analyzers are available globally and don't need to be -configured before being used.((("language analyzers", "using"))) They can be specified directly in the field +configured before being used. They can be specified directly in the field mapping: [source,js] @@ -13,7 +13,7 @@ PUT /my_index "blog": { "properties": { "title": { - "type": "string", + "type": "text", "analyzer": "english" <1> } } @@ -21,18 +21,25 @@ PUT /my_index } } -------------------------------------------------- +// CONSOLE + <1> The `title` field will use the `english` analyzer instead of the default `standard` analyzer. -Of course, by passing ((("english analyzer", "information lost with")))text through the `english` analyzer, we lose -information: +Of course, by passing text through the `english` analyzer, we lose information: [source,js] -------------------------------------------------- -GET /my_index/_analyze?field=title <1> -I'm not happy about the foxes +GET /my_index/_analyze +{ + "field": "title" + "text": "I'm not happy about the foxes" <1> +} -------------------------------------------------- -<1> Emits token: `i'm`, `happi`, `about`, `fox` +// CONSOLE +// TEST[continued] + +<1> Emits the tokens: `i'm`, `happi`, `about`, `fox` We can't tell if the document mentions one `fox` or many `foxes`; the word `not` is a stopword and is removed, so we can't tell whether the document is @@ -41,7 +48,7 @@ recall as we can match more loosely, but we have reduced our ability to rank documents accurately. To get the best of both worlds, we can use <> to -index the `title` field twice: once((("multifields", "using to index a field with two different analyzers"))) with the `english` analyzer and once with +index the `title` field twice: once with the `english` analyzer and once with the `standard` analyzer: [source,js] @@ -52,10 +59,10 @@ PUT /my_index "blog": { "properties": { "title": { <1> - "type": "string", + "type": "text", "fields": { "english": { <2> - "type": "string", + "type": "text", "analyzer": "english" } } @@ -65,6 +72,8 @@ PUT /my_index } } -------------------------------------------------- +// CONSOLE + <1> The main `title` field uses the `standard` analyzer. <2> The `title.english` subfield uses the `english` analyzer. @@ -90,12 +99,13 @@ GET /_search } } -------------------------------------------------- +// CONSOLE +// TEST[continued] + <1> Use the <> query type to match the same text in as many fields as possible. -Even ((("most fields queries")))though neither of our documents contain the word `foxes`, both documents -are returned as results thanks to the word stemming on the `title.english` -field. The second document is ranked as more relevant, because the word `not` -matches on the `title` field. - - +Even though neither of our documents contain the +word `foxes`, both documents are returned as results thanks to the word +stemming on the `title.english` field. The second document is ranked as more +relevant, because the word `not` matches on the `title` field. diff --git a/200_Language_intro/20_Configuring.asciidoc b/200_Language_intro/20_Configuring.asciidoc index 6034d2674..0f713a927 100644 --- a/200_Language_intro/20_Configuring.asciidoc +++ b/200_Language_intro/20_Configuring.asciidoc @@ -2,13 +2,13 @@ === Configuring Language Analyzers While the language analyzers can be used out of the box without any -configuration, most of them ((("english analyzer", "configuring")))((("language analyzers", "configuring")))do allow you to control aspects of their +configuration, most of them do allow you to control aspects of their behavior, specifically: [[stem-exclusion]] Stem-word exclusion:: + -Imagine, for instance, that users searching for((("language analyzers", "configuring", "stem word exclusion")))((("stemming words", "stem word exclusion, configuring"))) the ``World Health +Imagine, for instance, that users searching for the ``World Health Organization'' are instead getting results for ``organ health.'' The reason for this confusion is that both ``organ'' and ``organization'' are stemmed to the same root word: `organ`. Often this isn't a problem, but in this @@ -18,7 +18,7 @@ stemmed. Custom stopwords:: -The default list of stopwords((("stopwords", "configuring for language analyzers"))) used in English are as follows: +The default list of stopwords used in English are as follows: + a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, @@ -54,13 +54,17 @@ PUT /my_index } } -GET /my_index/_analyze?analyzer=my_english <3> -The World Health Organization does not sell organs. +GET /my_index/_analyze +{ + "analyzer": "my_english", <3> + "text": "The World Health Organization does not sell organs." +} -------------------------------------------------- +// CONSOLE + <1> Prevents `organization` and `organizations` from being stemmed <2> Specifies a custom list of stopwords -<3> Emits tokens `world`, `health`, `organization`, `does`, `not`, `sell`, `organ` +<3> Emits tokens `world`, `health`, `organization`, `doe`, `not`, `sell`, `organ` We discuss stemming and stopwords in much more detail in <> and <>, respectively. - diff --git a/200_Language_intro/30_Language_pitfalls.asciidoc b/200_Language_intro/30_Language_pitfalls.asciidoc index 27713bcff..d56780c47 100644 --- a/200_Language_intro/30_Language_pitfalls.asciidoc +++ b/200_Language_intro/30_Language_pitfalls.asciidoc @@ -1,9 +1,9 @@ [[language-pitfalls]] === Pitfalls of Mixing Languages -If you have to deal with only a single language,((("languages", "mixing, pitfalls of"))) count yourself lucky. +If you have to deal with only a single language, count yourself lucky. Finding the right strategy for handling documents written in several languages -can be challenging.((("indexing", "mixed languages, pitfalls of"))) +can be challenging. ==== At Index Time @@ -21,11 +21,11 @@ separate. Mixing languages in the same inverted index can be problematic. ===== Incorrect stemming The stemming rules for German are different from those for English, French, -Swedish, and so on.((("stemming words", "incorrect stemming in multilingual documents"))) Applying the same stemming rules to different languages +Swedish, and so on. Applying the same stemming rules to different languages will result in some words being stemmed correctly, some incorrectly, and some -not being stemmed at all. It may even result in words from different languages with different meanings -being stemmed to the same root word, conflating their meanings and producing -confusing search results for the user. +not being stemmed at all. It may even result in words from different languages +with different meanings being stemmed to the same root word, conflating their +meanings and producing confusing search results for the user. Applying multiple stemmers in turn to the same text is likely to result in rubbish, as the next stemmer may try to stem an already stemmed word, @@ -49,7 +49,7 @@ text. ===== Incorrect inverse document frequencies In <>, we explained that the more frequently a term appears -in a collection of documents, the less weight that term has.((("inverse document frequency", "incorrect, in multilingual documents"))) For accurate +in a collection of documents, the less weight that term has. For accurate relevance calculations, you need accurate term-frequency statistics. A short snippet of German appearing in predominantly English text would give @@ -59,11 +59,11 @@ snippets now have much less weight. ==== At Query Time -It is not sufficient just to think about your documents, though.((("queries", "mixed languages and"))) You also need -to think about how your users will query those documents. Often you will be able -to identify the main language of the user either from the language of that user's chosen -interface (for example, `mysite.de` versus `mysite.fr`) or from the -http://www.w3.org/International/questions/qa-lang-priorities.en.php[`accept-language`] +It is not sufficient just to think about your documents, though. You also need +to think about how your users will query those documents. Often you will be +able to identify the main language of the user either from the language of that +user's chosen interface (for example, `mysite.de` versus `mysite.fr`) or from +the http://www.w3.org/International/questions/qa-lang-priorities.en.php[`accept-language`] HTTP header from the user's browser. User searches also come in three main varieties: @@ -72,7 +72,8 @@ User searches also come in three main varieties: * Users search for words in a different language, but expect results in their main language. * Users search for words in a different language, and expect results in - that language (for example, a bilingual person, or a foreign visitor in a web cafe). + that language (for example, a bilingual person, or a foreign visitor in a web + cafe). Depending on the type of data that you are searching, it may be appropriate to return results in a single language (for example, a user searching for products on @@ -102,7 +103,7 @@ library from http://blog.mikemccandless.com/2013/08/a-new-version-of-compact-language.html[Mike McCandless], which uses the open source (http://www.apache.org/licenses/LICENSE-2.0[Apache License 2.0]) https://code.google.com/p/cld2/[Compact Language Detector] (CLD) from Google. It is -small, fast, ((("Compact Language Detector (CLD)")))and accurate, and can detect 160+ languages from as little as two +small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R. @@ -113,4 +114,3 @@ Shorter amounts of text, such as search keywords, produce much less accurate results. In these cases, it may be preferable to take simple heuristics into account such as the country of origin, the user's selected language, and the HTTP `accept-language` headers. - diff --git a/200_Language_intro/40_One_language_per_doc.asciidoc b/200_Language_intro/40_One_language_per_doc.asciidoc index e62021a4e..e81492e1c 100644 --- a/200_Language_intro/40_One_language_per_doc.asciidoc +++ b/200_Language_intro/40_One_language_per_doc.asciidoc @@ -1,10 +1,10 @@ [[one-lang-docs]] === One Language per Document -A single predominant language per document ((("languages", "one language per document")))((("indices", "documents in different languages")))requires a relatively simple setup. +A single predominant language per document requires a relatively simple setup. Documents from different languages can be stored in separate indices—`blogs-en`, -`blogs-fr`, and so forth—that use the same type and the same fields for each index, -just with different analyzers: +`blogs-fr`, and so forth—that use the same fields for each index, just +with different analyzers: [source,js] -------------------------------------------------- @@ -14,13 +14,18 @@ PUT /blogs-en "post": { "properties": { "title": { - "type": "string", <1> + "type": "text", <1> "fields": { "stemmed": { - "type": "string", + "type": "string", "analyzer": "english" <2> } -}}}}}} + } + } + } + } + } +} PUT /blogs-fr { @@ -28,14 +33,21 @@ PUT /blogs-fr "post": { "properties": { "title": { - "type": "string", <1> + "type": "text", <1> "fields": { "stemmed": { - "type": "string", + "type": "text", "analyzer": "french" <2> } -}}}}}} + } + } + } + } + } +} -------------------------------------------------- +//CONSOLE + <1> Both `blogs-en` and `blogs-fr` have a type called `post` that contains the field `title`. <2> The `title.stemmed` subfield uses a language-specific analyzer. @@ -48,25 +60,34 @@ don't suffer from the term-frequency and stemming problems described in The documents of a single language can be queried independently, or queries can target multiple languages by querying multiple indices. We can even -specify a preference((("indices_boost parameter", "specifying preference for a specific language"))) for particular languages with the `indices_boost` parameter: +specify a preference for particular languages with the `indices_boost` parameter: [source,js] -------------------------------------------------- +PUT /blogs-en/post/1 +{ "title": "That feeling of déjà vu" } + +PUT /blogs-fr/post/1 +{ "title": "Ce sentiment de déjà vu" } + GET /blogs-*/post/_search <1> { "query": { "multi_match": { "query": "deja vu", - "fields": [ "title", "title.stemmed" ] <2> + "fields": [ "title", "title.stemmed" ], <2> "type": "most_fields" } }, - "indices_boost": { <3> - "blogs-en": 3, - "blogs-fr": 2 - } + "indices_boost": [ <3> + { "blogs-en": 3 }, + { "blogs-fr": 2 } + ] } -------------------------------------------------- +// CONSOLE +// TEST[continued] + <1> This search is performed on any index beginning with `blogs-`. <2> The `title.stemmed` fields are queried using the analyzer specified in each index. @@ -78,27 +99,11 @@ GET /blogs-*/post/_search <1> Of course, these documents may contain words or sentences in other languages, and these words are unlikely to be stemmed correctly. With -predominant-language documents, this is not usually a major problem. The user will -often search for the exact words--for instance, of a quotation from another +predominant-language documents, this is not usually a major problem. The user +will often search for the exact words--for instance, of a quotation from another language--rather than for inflections of a word. Recall can be improved by using techniques explained in <>. Perhaps some words like place names should be queryable in the predominant language and in the original language, such as _Munich_ and _München_. These words are effectively synonyms, which we discuss in <>. - -.Don't Use Types for Languages -************************************************* - -You may be tempted to use a separate type for each language,((("types", "not using for languages")))((("languages", "not using types for"))) instead of a -separate index. For best results, you should avoid using types for this -purpose. As explained in <>, fields from different types but with -the same field name are indexed into the _same inverted index_. This means -that the term frequencies from each type (and thus each language) are mixed -together. - -To ensure that the term frequencies of one language don't pollute those of -another, either use a separate index for each language, or a separate field, -as explained in the next section. - -************************************************* diff --git a/200_Language_intro/50_One_language_per_field.asciidoc b/200_Language_intro/50_One_language_per_field.asciidoc index ba4ebf9ec..23ada51d4 100644 --- a/200_Language_intro/50_One_language_per_field.asciidoc +++ b/200_Language_intro/50_One_language_per_field.asciidoc @@ -1,10 +1,11 @@ [[one-lang-fields]] === One Language per Field -For documents that represent entities like products, movies, or legal notices, it is common((("fields", "one language per field")))((("languages", "one language per field"))) -for the same text to be translated into several languages. Although each translation -could be represented in a single document in an index per language, another -reasonable approach is to keep all translations in the same document: +For documents that represent entities like products, movies, or legal notices, +it is common for the same text to be translated into several languages. Although +each translation could be represented in a single document in an index per +language, another reasonable approach is to keep all translations in the same +document: [source,js] -------------------------------------------------- @@ -29,29 +30,31 @@ PUT /movies "movie": { "properties": { "title": { <1> - "type": "string" + "type": "text" }, "title_br": { <2> - "type": "string", - "analyzer": "brazilian" + "type": "text", + "analyzer": "brazilian" }, "title_cz": { <2> - "type": "string", - "analyzer": "czech" + "type": "text", + "analyzer": "czech" }, "title_en": { <2> - "type": "string", - "analyzer": "english" + "type": "text", + "analyzer": "english" }, "title_es": { <2> - "type": "string", - "analyzer": "spanish" + "type": "text", + "analyzer": "spanish" } } } } } -------------------------------------------------- +// CONSOLE + <1> The `title` field contains the original title and uses the `standard` analyzer. <2> Each of the other fields uses the appropriate analyzer for @@ -59,19 +62,31 @@ PUT /movies Like the _index-per-language_ approach, the _field-per-language_ approach maintains clean term frequencies. It is not quite as flexible as having -separate indices. Although it is easy to add a new field by using the <>, those new fields may require new -custom analyzers, which can only be set up at index creation time. As a -workaround, you can {ref}/indices-open-close.html[close] the index, add the new -analyzers with the {ref}/indices-update-settings.html[`update-settings` API], +separate indices. Although it is easy to add a new field by using the <>, +those new fields may require new custom analyzers, which can only be set up at +index creation time. As a workaround, you can {ref}/indices-open-close.html[close] +the index, add the new analyzers with the {ref}/indices-update-settings.html[`update-settings` API], then reopen the index, but closing the index means that it will require some downtime. -The documents of a((("boosting", "query-time", "boosting a field"))) single language can be queried independently, or queries +The documents of a single language can be queried independently, or queries can target multiple languages by querying multiple fields. We can even specify a preference for particular languages by boosting that field: [source,js] -------------------------------------------------- +PUT /movies/movie/1 +{ + "title": "Fight club", + "title_br": "Clube de Luta", + "title_cz": "Klub rváčů", + "title_en": "Fight club", + "title_es": "El club de la lucha" +} + +PUT /movies/movie/2 +{ "title": "Superhero Fight Club" } + GET /movies/movie/_search { "query": { @@ -83,7 +98,9 @@ GET /movies/movie/_search } } -------------------------------------------------- +// CONSOLE +// TEST[continued] + <1> This search queries any field beginning with `title` but boosts the `title_es` field by `2`. All other fields have a neutral boost of `1`. - diff --git a/200_Language_intro/60_Mixed_language_fields.asciidoc b/200_Language_intro/60_Mixed_language_fields.asciidoc index 1b8bd1057..e5897a43b 100644 --- a/200_Language_intro/60_Mixed_language_fields.asciidoc +++ b/200_Language_intro/60_Mixed_language_fields.asciidoc @@ -2,7 +2,7 @@ === Mixed-Language Fields Usually, documents that mix multiple languages in a single field come from -sources beyond your control, such as((("languages", "mixed language fields")))((("fields", "mixed language"))) pages scraped from the Web: +sources beyond your control, such as pages scraped from the Web: [source,js] -------------------------------------------------- @@ -17,7 +17,8 @@ Or rather, stemmers are language and script specific. As discussed in <>, if every language uses a different script, then stemmers can be combined. -Assuming that your mix of languages uses the same script such as Latin, you have three choices available to you: +Assuming that your mix of languages uses the same script such as Latin, you have +three choices available to you: * Split into separate fields * Analyze multiple times @@ -25,14 +26,14 @@ Assuming that your mix of languages uses the same script such as Latin, you have ==== Split into Separate Fields -The Compact Language Detector ((("languages", "mixed language fields", "splitting into separate fields")))((("Compact Language Detector (CLD)")))mentioned in <> can tell +The Compact Language Detector mentioned in <> can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in <>. ==== Analyze Multiple Times -If you primarily deal with a limited number of languages, ((("languages", "mixed language fields", "analyzing multiple times")))((("analyzers", "for mixed language fields")))((("multifields", "analying mixed language fields")))you could use +If you primarily deal with a limited number of languages, you could use multi-fields to analyze the text once per language: [source,js] @@ -43,22 +44,22 @@ PUT /movies "title": { "properties": { "title": { <1> - "type": "string", + "type": "text", "fields": { "de": { <2> - "type": "string", + "type": "text", "analyzer": "german" }, "en": { <2> - "type": "string", + "type": "text", "analyzer": "english" }, "fr": { <2> - "type": "string", + "type": "text", "analyzer": "french" }, "es": { <2> - "type": "string", + "type": "text", "analyzer": "spanish" } } @@ -68,15 +69,18 @@ PUT /movies } } -------------------------------------------------- +// CONSOLE + <1> The main `title` field uses the `standard` analyzer. <2> Each subfield applies a different language analyzer to the text in the `title` field. ==== Use n-grams -You could index all words as n-grams, using the ((("n-grams", "for mixed language fields")))((("languages", "mixed language fields", "n-grams, indexing words as")))same approach as +You could index all words as n-grams, using the same approach as described in <>. Most inflections involve adding a -suffix (or in some languages, a prefix) to a word, so by breaking each word into n-grams, you have a good chance of matching words that are similar +suffix (or in some languages, a prefix) to a word, so by breaking each word into +n-grams, you have a good chance of matching words that are similar but not exactly the same. This can be combined with the _analyze-multiple times_ approach to provide a catchall field for unsupported languages: @@ -85,32 +89,50 @@ times_ approach to provide a catchall field for unsupported languages: PUT /movies { "settings": { - "analysis": {...} <1> + "analysis": { + "filter": { + "trigrams_filter": { + "type": "ngram", + "min_gram": 3, + "max_gram": 3 + } + }, + "analyzer": { + "trigrams": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "trigrams_filter" + ] + } + } + } }, "mappings": { "title": { "properties": { "title": { - "type": "string", + "type": "text", "fields": { "de": { - "type": "string", + "type": "text", "analyzer": "german" }, "en": { - "type": "string", + "type": "text", "analyzer": "english" }, "fr": { - "type": "string", + "type": "text", "analyzer": "french" }, "es": { - "type": "string", + "type": "text", "analyzer": "spanish" }, - "general": { <2> - "type": "string", + "general": { <1> + "type": "text", "analyzer": "trigrams" } } @@ -120,9 +142,9 @@ PUT /movies } } -------------------------------------------------- -<1> In the `analysis` section, we define the same `trigrams` - analyzer as described in <>. -<2> The `title.general` field uses the `trigrams` analyzer +// CONSOLE + +<1> The `title.general` field uses the `trigrams` custom analyzer to index any language. When querying the catchall `general` field, you can use @@ -133,6 +155,13 @@ than those on the `general` field: [source,js] -------------------------------------------------- +PUT /movies/movie/1 +{ "title": "club de la lucha" } + + +PUT /movies/movie/2 +{ "title": "Superhero Fight Club" } + GET /movies/movie/_search { "query": { @@ -145,8 +174,10 @@ GET /movies/movie/_search } } -------------------------------------------------- +// CONSOLE +// TEST[continued] + <1> All `title` or `title.*` fields are given a slight boost over the `title.general` field. -<2> The `minimum_should_match` parameter reduces the number of low-quality matches returned, especially important for the `title.general` field. - - +<2> The `minimum_should_match` parameter reduces the number of low-quality + matches returned, especially important for the `title.general` field. From e926454b2ebcbcff5f25f741420a40c39340d706 Mon Sep 17 00:00:00 2001 From: Josh Rich Date: Tue, 2 May 2017 14:55:58 +1000 Subject: [PATCH 083/107] Fix em-dash usage. --- 200_Language_intro/40_One_language_per_doc.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/200_Language_intro/40_One_language_per_doc.asciidoc b/200_Language_intro/40_One_language_per_doc.asciidoc index e81492e1c..e9bb00f9c 100644 --- a/200_Language_intro/40_One_language_per_doc.asciidoc +++ b/200_Language_intro/40_One_language_per_doc.asciidoc @@ -2,8 +2,8 @@ === One Language per Document A single predominant language per document requires a relatively simple setup. -Documents from different languages can be stored in separate indices—`blogs-en`, -`blogs-fr`, and so forth—that use the same fields for each index, just +Documents from different languages can be stored in separate indices -- `blogs-en`, +`blogs-fr`, and so forth -- that use the same fields for each index, just with different analyzers: [source,js] From 871bf63c8343d207cd6c984034db18757f1b7a51 Mon Sep 17 00:00:00 2001 From: Daniel Mitterdorfer Date: Tue, 16 May 2017 08:06:17 -0700 Subject: [PATCH 084/107] Don't compare with SQL in aggs chapters (#669) With this commit we remove all SQL references in the aggs chapters that don't contribute any value. --- 300_Aggregations/60_cardinality.asciidoc | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/300_Aggregations/60_cardinality.asciidoc b/300_Aggregations/60_cardinality.asciidoc index 0ec11c9b1..38fcf245f 100644 --- a/300_Aggregations/60_cardinality.asciidoc +++ b/300_Aggregations/60_cardinality.asciidoc @@ -2,14 +2,7 @@ === Finding Distinct Counts The first approximate aggregation provided by Elasticsearch is the `cardinality` -metric.((("cardinality", "finding distinct counts")))((("aggregations", "approximate", "cardinality")))((("approximate algorithms", "cardinality")))((("distinct counts"))) This provides the cardinality of a field, also called a _distinct_ or -_unique_ count. ((("unique counts"))) You may be familiar with the SQL version: - -[source, sql] --------- -SELECT COUNT(DISTINCT color) -FROM cars --------- +metric.((("cardinality", "finding distinct counts")))((("aggregations", "approximate", "cardinality")))((("approximate algorithms", "cardinality")))((("distinct counts"))) Distinct counts are a common operation, and answer many fundamental business questions: From e7ae568781d2ddc560dcaa8623943ce934e82316 Mon Sep 17 00:00:00 2001 From: lcawley Date: Fri, 19 May 2017 16:27:08 -0700 Subject: [PATCH 085/107] [DOCS] Fix callout in definitive guide --- 030_Data/45_Partial_update.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/030_Data/45_Partial_update.asciidoc b/030_Data/45_Partial_update.asciidoc index dc3cf6d0f..8dee35f3f 100644 --- a/030_Data/45_Partial_update.asciidoc +++ b/030_Data/45_Partial_update.asciidoc @@ -240,7 +240,8 @@ POST /website/pageviews/1/_update?retry_on_conflict=5 <1> } -------------------------------------------------- // SENSE: 030_Data/45_Upsert.json -<1> Retry this update five times before failing. + +\<1> Retry this update five times before failing. This works well for operations such as incrementing a counter, where the order of increments does not matter, but in other situations the order of @@ -249,4 +250,3 @@ adopts a _last-write-wins_ approach by default, but it also accepts a `version` parameter that allows you to use <> to specify which version of the document you intend to update. - From a5415ad1de47fd320a4c91ed6f248432a6392a6b Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Fri, 19 May 2017 16:35:34 -0700 Subject: [PATCH 086/107] Revert "[DOCS] Fix callout in definitive guide" --- 030_Data/45_Partial_update.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/030_Data/45_Partial_update.asciidoc b/030_Data/45_Partial_update.asciidoc index 8dee35f3f..dc3cf6d0f 100644 --- a/030_Data/45_Partial_update.asciidoc +++ b/030_Data/45_Partial_update.asciidoc @@ -240,8 +240,7 @@ POST /website/pageviews/1/_update?retry_on_conflict=5 <1> } -------------------------------------------------- // SENSE: 030_Data/45_Upsert.json - -\<1> Retry this update five times before failing. +<1> Retry this update five times before failing. This works well for operations such as incrementing a counter, where the order of increments does not matter, but in other situations the order of @@ -250,3 +249,4 @@ adopts a _last-write-wins_ approach by default, but it also accepts a `version` parameter that allows you to use <> to specify which version of the document you intend to update. + From 9d5348a36442ccef944f9dd452e6297392f19a2e Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Tue, 14 Nov 2017 15:57:16 +0100 Subject: [PATCH 087/107] Changed link from postings highlighter to unified highlighter --- 240_Stopwords/50_Phrase_queries.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/240_Stopwords/50_Phrase_queries.asciidoc b/240_Stopwords/50_Phrase_queries.asciidoc index 47e4a1065..8d89005d4 100644 --- a/240_Stopwords/50_Phrase_queries.asciidoc +++ b/240_Stopwords/50_Phrase_queries.asciidoc @@ -98,7 +98,7 @@ in the index for each field.((("fields", "index options"))) Valid values are as Store `docs`, `freqs`, `positions`, and the start and end character offsets of each term in the original string. This information is used by the - http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#postings-highlighter[`postings` highlighter] + https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#_unified_highlighter[`unified` highlighter] but is disabled by default. You can set `index_options` on fields added at index creation time, or when From 29c6557ab459a8aa4fada913b005989bc43a5ebd Mon Sep 17 00:00:00 2001 From: Deb Adair Date: Tue, 14 Nov 2017 08:06:59 -0800 Subject: [PATCH 088/107] Fixed cross doc link. --- 240_Stopwords/50_Phrase_queries.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/240_Stopwords/50_Phrase_queries.asciidoc b/240_Stopwords/50_Phrase_queries.asciidoc index 8d89005d4..ed73f3cb0 100644 --- a/240_Stopwords/50_Phrase_queries.asciidoc +++ b/240_Stopwords/50_Phrase_queries.asciidoc @@ -98,7 +98,7 @@ in the index for each field.((("fields", "index options"))) Valid values are as Store `docs`, `freqs`, `positions`, and the start and end character offsets of each term in the original string. This information is used by the - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#_unified_highlighter[`unified` highlighter] + https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-highlighting.html#_unified_highlighter[`unified` highlighter] but is disabled by default. You can set `index_options` on fields added at index creation time, or when From 07e372023224c202ed105a7bc0b34b8370578eed Mon Sep 17 00:00:00 2001 From: Deb Adair Date: Tue, 14 Nov 2017 08:38:07 -0800 Subject: [PATCH 089/107] Cross doc link fix --- 240_Stopwords/50_Phrase_queries.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/240_Stopwords/50_Phrase_queries.asciidoc b/240_Stopwords/50_Phrase_queries.asciidoc index ed73f3cb0..2b256985c 100644 --- a/240_Stopwords/50_Phrase_queries.asciidoc +++ b/240_Stopwords/50_Phrase_queries.asciidoc @@ -98,7 +98,7 @@ in the index for each field.((("fields", "index options"))) Valid values are as Store `docs`, `freqs`, `positions`, and the start and end character offsets of each term in the original string. This information is used by the - https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-highlighting.html#_unified_highlighter[`unified` highlighter] + https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-request-highlighting.html#_unified_highlighter[`unified` highlighter] but is disabled by default. You can set `index_options` on fields added at index creation time, or when From 5d531555c54e7ff8898680ab97d72e2bd9b5a916 Mon Sep 17 00:00:00 2001 From: Simon Willnauer Date: Tue, 30 Jan 2018 23:14:00 +0100 Subject: [PATCH 090/107] Replace tribe reference with cross cluster search tribe is not considered the right solution for cross cluster search access. It's been deprecated since 5.4 --- 410_Scaling/80_Scale_is_not_infinite.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/410_Scaling/80_Scale_is_not_infinite.asciidoc b/410_Scaling/80_Scale_is_not_infinite.asciidoc index 2e0cfdfdf..adcf6472a 100644 --- a/410_Scaling/80_Scale_is_not_infinite.asciidoc +++ b/410_Scaling/80_Scale_is_not_infinite.asciidoc @@ -83,5 +83,5 @@ small and agile. Eventually, despite your best intentions, you may find that the number of nodes and indices and mappings that you have is just too much for one cluster. At this stage, it is probably worth dividing the problem into multiple -clusters. Thanks to {ref}/modules-tribe.html[`tribe` nodes], you can even run +clusters. Thanks to {ref}/cross-cluster-search.html[cross cluster search], you can even run searches across multiple clusters, as if they were one big cluster. From 47e1e2c2cd41ba2d32f0f95da1daf394b5e00adb Mon Sep 17 00:00:00 2001 From: debadair Date: Tue, 30 Jan 2018 15:36:32 -0800 Subject: [PATCH 091/107] Fixed link to CCS topic in ES ref. --- 410_Scaling/80_Scale_is_not_infinite.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/410_Scaling/80_Scale_is_not_infinite.asciidoc b/410_Scaling/80_Scale_is_not_infinite.asciidoc index adcf6472a..1d5f7a2ea 100644 --- a/410_Scaling/80_Scale_is_not_infinite.asciidoc +++ b/410_Scaling/80_Scale_is_not_infinite.asciidoc @@ -83,5 +83,5 @@ small and agile. Eventually, despite your best intentions, you may find that the number of nodes and indices and mappings that you have is just too much for one cluster. At this stage, it is probably worth dividing the problem into multiple -clusters. Thanks to {ref}/cross-cluster-search.html[cross cluster search], you can even run +clusters. Thanks to {ref}/modules-cross-cluster-search.html[cross cluster search], you can even run searches across multiple clusters, as if they were one big cluster. From eb0004640922da772be5ccb61060642a23b67e6b Mon Sep 17 00:00:00 2001 From: Deb Adair Date: Mon, 25 Jun 2018 15:51:08 -0700 Subject: [PATCH 092/107] Updated header and added link to the ES Ref. --- page_header.html | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/page_header.html b/page_header.html index e9ec2bc89..bf090c627 100644 --- a/page_header.html +++ b/page_header.html @@ -1 +1,4 @@ -We are working on updating this book for the latest version. Some content might be out of date. \ No newline at end of file +This information may not apply to the latest version of Elasticsearch. +For the most up to date information, see the current version of the + +Elasticsearch Reference. From 8b22133e33bcbbfd2302c83ad73b362e539fb103 Mon Sep 17 00:00:00 2001 From: lcawl Date: Thu, 6 Dec 2018 10:37:49 -0800 Subject: [PATCH 093/107] [DOCS] Fixes link to Elasticsearch Reference --- 010_Intro/10_Installing_ES.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc index 0bb101cab..9860bd9cf 100644 --- a/010_Intro/10_Installing_ES.asciidoc +++ b/010_Intro/10_Installing_ES.asciidoc @@ -12,8 +12,8 @@ You can get the latest version of Elasticsearch from https://www.elastic.co/downloads/elasticsearch[_elastic.co/downloads/elasticsearch_]. To install Elasticsearch, download and extract the archive file for your platform. For -more information, see the {ref}/_installation.html[Installation] topic in the Elasticsearch -Reference. +more information, see the {ref}/install-elasticsearch.html[Installation] topic +in the Elasticsearch Reference. [TIP] ==== From a8b480e16c992175231c1b3243f600e5b7ab8fbb Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Tue, 9 Apr 2019 17:24:48 -0400 Subject: [PATCH 094/107] [DOCS] Fix broken link for 7.0 release --- 510_Deployment/40_config.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 1ec70c372..a15d56212 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -270,6 +270,6 @@ a day. This setting is configured in `elasticsearch.yml`: discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ---- -For more information about how Elasticsearch nodes find eachother, see -https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] +For more information about how Elasticsearch nodes find each other, see +{ref}/modules-discovery.html[Discovery and cluster formation] in the Elasticsearch Reference. From af9a210824be4814abf75d2508646cff74cf7dbf Mon Sep 17 00:00:00 2001 From: Nik Everett Date: Tue, 23 Apr 2019 17:34:59 -0400 Subject: [PATCH 095/107] Cleanup list in preparation for moving to asciidoctor Asciidoctor like the list shaped like this. --- .../15_Create_index_delete.asciidoc | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/040_Distributed_CRUD/15_Create_index_delete.asciidoc b/040_Distributed_CRUD/15_Create_index_delete.asciidoc index 954be723d..ab35e8616 100644 --- a/040_Distributed_CRUD/15_Create_index_delete.asciidoc +++ b/040_Distributed_CRUD/15_Create_index_delete.asciidoc @@ -32,44 +32,37 @@ this process, possibly increasing performance at the cost of data security. These options are seldom used because Elasticsearch is already fast, but they are explained here for the sake of completeness: --- - `consistency`:: + --- By default, the primary shard((("consistency request parameter")))((("quorum"))) requires a _quorum_, or majority, of shard copies (where a shard copy can be a primary or a replica shard) to be available before even attempting a write operation. This is to prevent writing data to the ``wrong side'' of a network partition. A quorum is defined as follows: - ++ int( (primary + number_of_replicas) / 2 ) + 1 - ++ The allowed values for `consistency` are `one` (just the primary shard), `all` (the primary and all replicas), or the default `quorum`, or majority, of shard copies. - ++ Note that the `number_of_replicas` is the number of replicas _specified_ in the index settings, not the number of replicas that are currently active. If you have specified that an index should have three replicas, a quorum would be as follows: - ++ int( (primary + 3 replicas) / 2 ) + 1 = 3 - ++ But if you start only two nodes, there will be insufficient active shard copies to satisfy the quorum, and you will be unable to index or delete any documents. --- - `timeout`:: - ++ What happens if insufficient shard copies are available? Elasticsearch waits, in the hope that more shards will appear. By default, it will wait up to 1 minute. If you need to, you can use the `timeout` parameter((("timeout parameter"))) to make it abort sooner: `100` is 100 milliseconds, and `30s` is 30 seconds. --- - [NOTE] =================================================== A new index has `1` replica by default, which means that two active shard From f424aaf79a66a2912f7dd61f9b424115c4f9f14a Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 31 May 2019 11:50:50 -0400 Subject: [PATCH 096/107] [DOCS] Fixes broken link to `common` terms query --- 240_Stopwords/40_Divide_and_conquer.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc index e7a3c524b..0e1ca429c 100644 --- a/240_Stopwords/40_Divide_and_conquer.asciidoc +++ b/240_Stopwords/40_Divide_and_conquer.asciidoc @@ -188,5 +188,5 @@ documents that have 75% of all high-frequency terms with a query like this: } --------------------------------- -See the {ref}/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. +See the https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. From 9e8bf88bcf119e595ee961aaf20ffe6e516f76c3 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 31 May 2019 11:54:30 -0400 Subject: [PATCH 097/107] [DOCS] Fixes broken link to `common` terms query --- 240_Stopwords/40_Divide_and_conquer.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc index 0e1ca429c..bc47b7ca3 100644 --- a/240_Stopwords/40_Divide_and_conquer.asciidoc +++ b/240_Stopwords/40_Divide_and_conquer.asciidoc @@ -188,5 +188,5 @@ documents that have 75% of all high-frequency terms with a query like this: } --------------------------------- -See the https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. +See the https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. From cf9b1f699dbbf447666d4b235f668cb3489db666 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 31 May 2019 13:35:40 -0400 Subject: [PATCH 098/107] [DOCS] Remove relative `/current` links to Elasticsearch Reference Guide The `/current` branch of the Elasticsearch Reference Guide changes frequently. Links to that branch of the documentation often break as pages are removed. This changes most `/current` links to use the `2.4` branch. --- 070_Index_Mgmt/20_Custom_Analyzers.asciidoc | 2 +- 200_Language_intro/00_Intro.asciidoc | 2 +- 230_Stemming/60_Stemming_in_situ.asciidoc | 2 +- 240_Stopwords/40_Divide_and_conquer.asciidoc | 2 +- 300_Aggregations/100_circuit_breaker_fd_settings.asciidoc | 2 +- 404_Parent_Child/60_Children_agg.asciidoc | 2 +- book.asciidoc | 2 +- 7 files changed, 7 insertions(+), 7 deletions(-) diff --git a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc index bf7c6ee11..8209051ab 100644 --- a/070_Index_Mgmt/20_Custom_Analyzers.asciidoc +++ b/070_Index_Mgmt/20_Custom_Analyzers.asciidoc @@ -48,7 +48,7 @@ After tokenization, the resulting _token stream_ is passed through any specified token filters,((("token filters"))) in the order in which they are specified. Token filters may change, add, or remove tokens. We have already mentioned the -http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenizer.html[`lowercase`] and +{ref}/analysis-lowercase-tokenizer.html[`lowercase`] and {ref}/analysis-stop-tokenfilter.html[`stop` token filters], but there are many more available in Elasticsearch. {ref}/analysis-stemmer-tokenfilter.html[Stemming token filters] diff --git a/200_Language_intro/00_Intro.asciidoc b/200_Language_intro/00_Intro.asciidoc index 0d0c34440..8b85f501c 100644 --- a/200_Language_intro/00_Intro.asciidoc +++ b/200_Language_intro/00_Intro.asciidoc @@ -2,7 +2,7 @@ == Getting Started with Languages Elasticsearch ships with a collection of language analyzers that provide -good, basic, https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html[out-of-the-box support] +good, basic, {ref}/analysis-lang-analyzer.html[out-of-the-box support] for many of the world's most common languages. These analyzers typically perform four roles: diff --git a/230_Stemming/60_Stemming_in_situ.asciidoc b/230_Stemming/60_Stemming_in_situ.asciidoc index 8670c7a6d..cb32a53e0 100644 --- a/230_Stemming/60_Stemming_in_situ.asciidoc +++ b/230_Stemming/60_Stemming_in_situ.asciidoc @@ -19,7 +19,7 @@ Pos 4: (jumped,jump) <1> WARNING: Read <> before using this approach. To achieve stemming _in situ_, we will use the -http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html[`keyword_repeat`] +{ref}/analysis-keyword-repeat-tokenfilter.html[`keyword_repeat`] token filter,((("keyword_repeat token filter"))) which, like the `keyword_marker` token filter (see <>), marks each term as a keyword to prevent the subsequent stemmer from touching it. However, it also repeats the term in the same diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc index bc47b7ca3..e7a3c524b 100644 --- a/240_Stopwords/40_Divide_and_conquer.asciidoc +++ b/240_Stopwords/40_Divide_and_conquer.asciidoc @@ -188,5 +188,5 @@ documents that have 75% of all high-frequency terms with a query like this: } --------------------------------- -See the https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. +See the {ref}/query-dsl-common-terms-query.html[`common` terms query] reference page for more options. diff --git a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc index 57cfb87da..69cc8c968 100644 --- a/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc +++ b/300_Aggregations/100_circuit_breaker_fd_settings.asciidoc @@ -135,7 +135,7 @@ indicate a serious resource issue and a reason for poor performance. Fielddata usage can be monitored: -* per-index using the http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html[`indices-stats` API]: +* per-index using the {ref}/indices-stats.html[`indices-stats` API]: + [source,json] ------------------------------- diff --git a/404_Parent_Child/60_Children_agg.asciidoc b/404_Parent_Child/60_Children_agg.asciidoc index 6af80f0ec..e8b47c62e 100644 --- a/404_Parent_Child/60_Children_agg.asciidoc +++ b/404_Parent_Child/60_Children_agg.asciidoc @@ -2,7 +2,7 @@ === Children Aggregation Parent-child supports a -http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html[`children` aggregation] as ((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation")))a direct analog to the `nested` aggregation discussed in +{ref}/search-aggregations-bucket-children-aggregation.html[`children` aggregation] as ((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation")))a direct analog to the `nested` aggregation discussed in <>. A parent aggregation (the equivalent of `reverse_nested`) is not supported. diff --git a/book.asciidoc b/book.asciidoc index e3c62671a..9a0083add 100644 --- a/book.asciidoc +++ b/book.asciidoc @@ -1,6 +1,6 @@ :bookseries: animal :es_build: 1 -:ref: https://www.elastic.co/guide/en/elasticsearch/reference/master +:ref: https://www.elastic.co/guide/en/elasticsearch/reference/2.4 = Elasticsearch: The Definitive Guide From 10a61334e4f61eae7aed19472fd625de2b77681c Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Fri, 31 May 2019 15:56:31 -0400 Subject: [PATCH 099/107] [DOCS] Remove "we're updating this book" note. As of writing, there are no plans to update the Definitive Guide for Elasticsearch versions past 2.4. --- book-docinfo.xml | 4 ---- 1 file changed, 4 deletions(-) diff --git a/book-docinfo.xml b/book-docinfo.xml index 07834aade..713e5cfd1 100644 --- a/book-docinfo.xml +++ b/book-docinfo.xml @@ -1,4 +1,3 @@ -PLEASE NOTE:
We are working on updating this book for the latest version. Some content might be out of date.?> @@ -15,9 +14,6 @@ Elasticsearch: The Definitive Guide, Second Edition - - We are working on updating this book for the latest version. Some content might be out of date. - If you would like to purchase an eBook or printed version of this book once it is complete, you can do so from O'Reilly Media: From 53ec34aeecd3441b91e19185f9a644c8bf7c3432 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Mon, 3 Jun 2019 12:04:12 -0400 Subject: [PATCH 100/107] [DOCS] Fix broken 2.4 links to Elasticsearch Reference Guide --- 010_Intro/10_Installing_ES.asciidoc | 2 +- 170_Relevance/65_Script_score.asciidoc | 2 +- 410_Scaling/80_Scale_is_not_infinite.asciidoc | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/010_Intro/10_Installing_ES.asciidoc b/010_Intro/10_Installing_ES.asciidoc index 9860bd9cf..fff1acc1f 100644 --- a/010_Intro/10_Installing_ES.asciidoc +++ b/010_Intro/10_Installing_ES.asciidoc @@ -12,7 +12,7 @@ You can get the latest version of Elasticsearch from https://www.elastic.co/downloads/elasticsearch[_elastic.co/downloads/elasticsearch_]. To install Elasticsearch, download and extract the archive file for your platform. For -more information, see the {ref}/install-elasticsearch.html[Installation] topic +more information, see the https://www.elastic.co/guide/en/elasticsearch/reference/5.6/install-elasticsearch.html[Installation] topic in the Elasticsearch Reference. [TIP] diff --git a/170_Relevance/65_Script_score.asciidoc b/170_Relevance/65_Script_score.asciidoc index ca914ea49..e2358f9d4 100644 --- a/170_Relevance/65_Script_score.asciidoc +++ b/170_Relevance/65_Script_score.asciidoc @@ -115,7 +115,7 @@ scripts are not quite fast enough, you have three options: document. * Groovy is fast, but not quite as fast as Java.((("Java", "scripting in"))) You could reimplement your script as a native Java script. (See - {ref}/modules-scripting-native.html[Native Java Scripts]). + https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-scripting-native.html[Native Java Scripts]). * Use the `rescore` functionality((("rescoring"))) described in <> to apply your script to only the best-scoring documents. diff --git a/410_Scaling/80_Scale_is_not_infinite.asciidoc b/410_Scaling/80_Scale_is_not_infinite.asciidoc index 1d5f7a2ea..71032070d 100644 --- a/410_Scaling/80_Scale_is_not_infinite.asciidoc +++ b/410_Scaling/80_Scale_is_not_infinite.asciidoc @@ -83,5 +83,5 @@ small and agile. Eventually, despite your best intentions, you may find that the number of nodes and indices and mappings that you have is just too much for one cluster. At this stage, it is probably worth dividing the problem into multiple -clusters. Thanks to {ref}/modules-cross-cluster-search.html[cross cluster search], you can even run +clusters. Thanks to https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-cross-cluster-search.html[cross cluster search], you can even run searches across multiple clusters, as if they were one big cluster. From 5ccd7a399ca07114a2a90e611772925eaec09b3f Mon Sep 17 00:00:00 2001 From: Nik Everett Date: Thu, 10 Oct 2019 12:06:41 -0400 Subject: [PATCH 101/107] Fix snippets One bad path and one missing file. This fixes the bad path and replaces the missing file with `// AUTOSENSE` rather than overriding the file. --- 054_Query_DSL/75_Combining_queries_together.asciidoc | 2 +- 300_Aggregations/65_percentiles.asciidoc | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/054_Query_DSL/75_Combining_queries_together.asciidoc b/054_Query_DSL/75_Combining_queries_together.asciidoc index bbba76b25..f784965e3 100644 --- a/054_Query_DSL/75_Combining_queries_together.asciidoc +++ b/054_Query_DSL/75_Combining_queries_together.asciidoc @@ -145,7 +145,7 @@ will be identical, but it may aid in query simplicity/clarity. } } -------------------------------------------------- -// SENSE: 054_Query_DSL/70_bool_query.json +// SENSE: 054_Query_DSL/70_Bool_query.json <1> A `term` query is placed inside the `constant_score`, converting it to a non-scoring filter. This method can be used in place of a `bool` query which only diff --git a/300_Aggregations/65_percentiles.asciidoc b/300_Aggregations/65_percentiles.asciidoc index 5ad642b21..fd3cd216e 100644 --- a/300_Aggregations/65_percentiles.asciidoc +++ b/300_Aggregations/65_percentiles.asciidoc @@ -76,7 +76,7 @@ POST /website/logs/_bulk { "index": {}} { "latency" : 319, "zone" : "EU", "timestamp" : "2014-10-29" } ---- -// SENSE: 300_Aggregations/65_percentiles.json +// AUTOSENSE This data contains three values: a latency, a data center zone, and a date timestamp. Let's run +percentiles+ over the whole dataset to get a feel for @@ -101,7 +101,7 @@ GET /website/logs/_search } } ---- -// SENSE: 300_Aggregations/65_percentiles.json +// AUTOSENSE <1> The `percentiles` metric is applied to the +latency+ field. <2> For comparison, we also execute an `avg` metric on the same field. @@ -163,7 +163,7 @@ GET /website/logs/_search } } ---- -// SENSE: 300_Aggregations/65_percentiles.json +// AUTOSENSE <1> First we separate our latencies into buckets, depending on their zone. <2> Then we calculate the percentiles per zone. <3> The +percents+ parameter accepts an array of percentiles that we want returned, @@ -254,7 +254,7 @@ GET /website/logs/_search } } ---- -// SENSE: 300_Aggregations/65_percentiles.json +// AUTOSENSE <1> The `percentile_ranks` metric accepts an array of values that you want ranks for. After running this aggregation, we get two values back: From 73ae13dd54d8d0ec189449e019209c08618cb6e5 Mon Sep 17 00:00:00 2001 From: Nik Everett Date: Thu, 17 Oct 2019 15:51:54 -0400 Subject: [PATCH 102/107] Add title-separator It prevents `The Definitive Guide` from becoming the subtitle in Asciidoctor. --- book.asciidoc | 1 + 1 file changed, 1 insertion(+) diff --git a/book.asciidoc b/book.asciidoc index 9a0083add..042d36b10 100644 --- a/book.asciidoc +++ b/book.asciidoc @@ -1,3 +1,4 @@ +:title-separator: | :bookseries: animal :es_build: 1 :ref: https://www.elastic.co/guide/en/elasticsearch/reference/2.4 From e8b7435f3d904fafc5e729733fe67a31ecca7b65 Mon Sep 17 00:00:00 2001 From: Nik Everett Date: Thu, 17 Oct 2019 16:52:02 -0400 Subject: [PATCH 103/107] Lock page names Asciidoctor derrives these page names differently and we'd prefer not to move them. So this locks them to the name that AsciiDoc gave them. --- 300_Aggregations/35_date_histogram.asciidoc | 1 + 510_Deployment/45_dont_touch.asciidoc | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/300_Aggregations/35_date_histogram.asciidoc b/300_Aggregations/35_date_histogram.asciidoc index e0271acc9..641f2cd42 100644 --- a/300_Aggregations/35_date_histogram.asciidoc +++ b/300_Aggregations/35_date_histogram.asciidoc @@ -275,6 +275,7 @@ total sale price, and a bar chart for each individual make (per quarter), as sho .Sales per quarter, with distribution per make image::images/elas_29in02.png["Sales per quarter, with distribution per make"] +[[_the_sky_8217_s_the_limit]] === The Sky's the Limit These were obviously simple examples, but the sky really is the limit diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index 1ca28a39c..4806caba5 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -1,4 +1,4 @@ - +[[_don_8217_t_touch_these_settings]] === Don't Touch These Settings! There are a few hotspots in Elasticsearch that people just can't seem to avoid From 3feada4a22d83e274a58cb6c635cd91ba7729cf4 Mon Sep 17 00:00:00 2001 From: Nik Everett Date: Mon, 16 Dec 2019 16:08:33 -0500 Subject: [PATCH 104/107] Add extra title page This makes it compatible with `--direct_html`. --- book-extra-title-page.html | 62 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 book-extra-title-page.html diff --git a/book-extra-title-page.html b/book-extra-title-page.html new file mode 100644 index 000000000..22aaf3511 --- /dev/null +++ b/book-extra-title-page.html @@ -0,0 +1,62 @@ +

+
+

+ + Clinton + + + Gormley + +

+
+
+

+ + Zachary + + + Tong + +

+
+
+ +
+ +
+ +
+
+

+ + Abstract + +

+

+ If you would like to purchase an eBook or printed version of this book once it is complete, you can do so from O'Reilly Media: + + Buy this book from O'Reilly Media + +

+

+ We welcome feedback – if you spot any errors or would like to suggest improvements, please + + open an issue + + on the GitHub repo. +

+
\ No newline at end of file From 363018b5da4f1ae3af017436b836a7d5031b6994 Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Fri, 17 Sep 2021 16:00:28 -0400 Subject: [PATCH 105/107] The Definitive Guide is no longer maintained We no longer maintain the Definitive Guide or this repo. These docs only cover the 1.x and 2.x versions of Elasticsearch, which have passed their EOL dates. Those interested in the latest info should use the [current Elasticsearch docs][0] instead. Changes: * Updates the page header and README to clearly state the docs are no longer maintained. * Updates the contribution guidelines to discourage pull request and issues. * Removes a section of the title page for contributions. [0]: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html --- CONTRIBUTING.md | 71 -------------------------------------- README.md | 22 ++++++++---- book-extra-title-page.html | 11 ++---- page_header.html | 16 ++++++--- 4 files changed, 29 insertions(+), 91 deletions(-) delete mode 100644 CONTRIBUTING.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index 1682f9f3a..000000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,71 +0,0 @@ -## Contributing to the Definitive Guide - -### Contributing documentation changes - -If you have a change that you would like to contribute, please find or open an -issue about it first. Talk about what you would like to do. It might be that -somebody is already working on it, or that there are particular issues that -you should know about before making the change. - -Where possible, stick to an 80 character line length in the asciidoc source -files. Do not exceed 120 characters. Use 2 space indents in code examples. - -The process for contributing to any of the [Elastic repositories](https://github.com/elastic/) -is similar. Details can be found below. - -### Fork and clone the repository - -You will need to fork the main repository and clone it to your local machine. -See the respective [Github help page](https://help.github.com/articles/fork-a-repo) -for help. - -### Submitting your changes - -Once your changes and tests are ready to submit for review: - -1. Test your changes - - [Build the complete book locally](https://github.com/elastic/elasticsearch-definitive-guide#building-the-definitive-guide) - and check and correct any errors that you encounter. - -2. Sign the Contributor License Agreement - - Please make sure you have signed our [Contributor License Agreement](https://www.elastic.co/contributor-agreement/). - We are not asking you to assign copyright to us, but to give us the right - to distribute your code without restriction. We ask this of all - contributors in order to assure our users of the origin and continuing - existence of the code. You only need to sign the CLA once. - -3. Rebase your changes - - Update your local repository with the most recent code from the main - repository, and rebase your branch on top of the latest `master` branch. - We prefer your initial changes to be squashed into a single commit. Later, - if we ask you to make changes, add them as separate commits. This makes - them easier to review. As a final step before merging we will either ask - you to squash all commits yourself or we'll do it for you. - - -4. Submit a pull request - - Push your local changes to your forked copy of the repository and - [submit a pull request](https://help.github.com/articles/using-pull-requests). - In the pull request, choose a title which sums up the changes that you - have made, and in the body provide more details about what your changes do. - Also mention the number of the issue where discussion has taken place, - e.g. "Closes #123". - -Then sit back and wait. There will probably be discussion about the pull -request and, if any changes are needed, we would love to work with you to get -your pull request merged. - -Please adhere to the general guideline that you should never force push -to a publicly shared branch. Once you have opened your pull request, you -should consider your branch publicly shared. Instead of force pushing -you can just add incremental commits; this is generally easier on your -reviewers. If you need to pick up changes from master, you can merge -master into your branch. A reviewer might ask you to rebase a -long-running pull request in which case force pushing is okay for that -request. Note that squashing at the end of the review process should -also not be done, that can be done when the pull request is [integrated -via GitHub](https://github.com/blog/2141-squash-your-commits). \ No newline at end of file diff --git a/README.md b/README.md index 0880c5dae..93ec8064a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,12 @@ -# The Definitive Guide to Elasticsearch +# The Definitive Guide to Elasticsearch + +This repository contains the source for the legacy [Definitive Guide to +Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html) +documentation and is no longer maintained. For the latest information, see the +current +Elasticsearch documentation. -This repository contains the sources to the "Definitive Guide to Elasticsearch" which you can [read online](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html). - ## Building the Definitive Guide In order to build this project, we rely on our [docs infrastructure](https://github.com/elastic/docs). @@ -26,13 +31,16 @@ The Definitive Guide is written in Asciidoc and the docs repo also contains a [s The Definitive Guide is available for multiple versions of Elasticsearch: -* The [branch `1.x`](https://github.com/elastic/elasticsearch-definitive-guide/tree/1.x) applies to Elasticsearch 1.x -* The [branch `2.x`](https://github.com/elastic/elasticsearch-definitive-guide/tree/2.x) applies to Elasticsearch 2.x -* The [branch `master`](https://github.com/elastic/elasticsearch-definitive-guide/tree/2.x) applies to master branch of Elasticsearch (the current development version) +* The [`1.x` branch](https://github.com/elastic/elasticsearch-definitive-guide/tree/1.x) applies to Elasticsearch 1.x +* The [`2.x` and `master` branches](https://github.com/elastic/elasticsearch-definitive-guide/tree/2.x) apply to Elasticsearch 2.x ## Contributing -Before contributing a change please read our [contribution guide](CONTRIBUTING.md). +This repository is no longer maintained. Pull requests and issues will not be +addressed. + +To contribute to the current Elasticsearch docs, refer to the [Elasticsearch +repo](https://github.com/elastic/elasticsearch/). ## License diff --git a/book-extra-title-page.html b/book-extra-title-page.html index 22aaf3511..d27cc141c 100644 --- a/book-extra-title-page.html +++ b/book-extra-title-page.html @@ -47,16 +47,9 @@

- If you would like to purchase an eBook or printed version of this book once it is complete, you can do so from O'Reilly Media: + If you would like to purchase an eBook or printed version of this book, you can do so from O'Reilly Media: Buy this book from O'Reilly Media

-

- We welcome feedback – if you spot any errors or would like to suggest improvements, please - - open an issue - - on the GitHub repo. -

-

\ No newline at end of file + diff --git a/page_header.html b/page_header.html index bf090c627..51e66c0f8 100644 --- a/page_header.html +++ b/page_header.html @@ -1,4 +1,12 @@ -This information may not apply to the latest version of Elasticsearch. -For the most up to date information, see the current version of the - -Elasticsearch Reference. +

+ WARNING: This documentation covers Elasticsearch 2.x. The 2.x + versions of Elasticsearch have passed their + EOL dates. If you are running + a 2.x version, we strongly advise you to upgrade. +

+

+ This documentation is no longer maintained and may be removed. For the latest + information, see the current + Elasticsearch documentation. +

From 28208796cf13db44b2b7b0cedd5f457a98491c00 Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Fri, 17 Sep 2021 19:45:43 -0400 Subject: [PATCH 106/107] consistent book title --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 93ec8064a..818753d1f 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ -# The Definitive Guide to Elasticsearch +# Elasticsearch: The Definitive Guide -This repository contains the source for the legacy [Definitive Guide to -Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html) +This repository contains the source for the legacy [Elasticsearch: The Definitive Guide](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html) documentation and is no longer maintained. For the latest information, see the current From 18e98480be5a5343acdadb27e8dbf2a728c1f0c8 Mon Sep 17 00:00:00 2001 From: James Rodewig <40268737+jrodewig@users.noreply.github.com> Date: Fri, 17 Sep 2021 19:56:38 -0400 Subject: [PATCH 107/107] consistent repo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 818753d1f..893b1d7ce 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ This repository is no longer maintained. Pull requests and issues will not be addressed. To contribute to the current Elasticsearch docs, refer to the [Elasticsearch -repo](https://github.com/elastic/elasticsearch/). +repository](https://github.com/elastic/elasticsearch/). ## License