Finished scaling chapter

clintongormley · clintongormley · commit 641b3031c58f · 2014-08-03T15:12:53.000+02:00
diff --git a/410_Scaling.asciidoc b/410_Scaling.asciidoc
@@ -18,5 +18,12 @@ include::410_Scaling/50_Index_templates.asciidoc[]
 
 include::410_Scaling/55_Retiring_data.asciidoc[]
 
+include::410_Scaling/60_Index_per_user.asciidoc[]
 
+include::410_Scaling/65_Shared_index.asciidoc[]
 
+include::410_Scaling/70_Faking_it.asciidoc[]
+
+include::410_Scaling/75_One_big_user.asciidoc[]
+
+include::410_Scaling/80_Scale_is_not_infinite.asciidoc[]
diff --git a/410_Scaling/10_Intro.asciidoc b/410_Scaling/10_Intro.asciidoc
@@ -18,5 +18,12 @@ are aware of those limitations and work with them, the growing process will be
 pleasant.  If you treat Elasticsearch badly, you could be in for a world of
 pain.
 
+The default settings in Elasticsearch will take you a long way but, to get the
+most bang for your buck, you need to think about how data flows through your
+system.  We will talk about two common data flows: <<time-based>> like log
+events or social network streams where relevance is driven by recency, and
+<<user-based>> where a large document corpus can be subdivided by user or
+customer.
+
 This chapter will help you to make the right decisions up front, to avoid
 nasty surprises later on.
diff --git a/410_Scaling/45_Index_per_timeframe.asciidoc b/410_Scaling/45_Index_per_timeframe.asciidoc
@@ -1,5 +1,5 @@
 [[time-based]]
-=== Time-based documents
+=== Time-based data
 
 One of the most common use cases for Elasticsearch is for logging, so common
 in fact that Elasticsearch provides an integrated logging platform called the
@@ -48,16 +48,18 @@ DELETE /logs/event/_query
 <1> Deletes all documents where Logstash's `@timestamp` field is
     older than 90 days.
 
-But this approach is very inefficient.  Remember that when you delete a
+But this approach is *very inefficient*.  Remember that when you delete a
 document, it is only marked as deleted (see <<deletes-and-updates>>). It won't
 be physically deleted until the segment containing it is merged away.
 
 Instead, use an _index per-timeframe_. You could start out with an index per
-year -- `logs_2014` -- or per month -- `logs_2014-10`. Perhaps, when your
+year -- `logs_2014` -- or per month -- `logs_2014-10`.  Perhaps, when your
 website gets really busy, you need to switch to an index per day --
-`logs_2014-10-24`.  The point is: you can adjust your indexing capacity as you
-go.  There is no need to make a final decision up front.
+`logs_2014-10-24`.  Purging old data is easy -- just delete old indices.
 
+This approach has the advantage of allowing you to scale as and when you need
+to.  You don't have to make any difficult decisions up front.  Every day is a
+new opportunity to change your indexing timeframes to suit the current demand.
 Apply the same logic to how big you make each index.  Perhaps all you need is
 one primary shard per week initially.  Later, maybe you need 5 primary shards
 per day.  It doesn't matter -- you can adjust to new circumstances at any
diff --git a/410_Scaling/60_Index_per_user.asciidoc b/410_Scaling/60_Index_per_user.asciidoc
@@ -0,0 +1,39 @@
+[[user-based]]
+=== User-based data
+
+Often, users start using Elasticsearch because they need to add full text
+search or analytics to an existing application.  They create a single index
+which holds all of their documents.  Gradually, others in the company realise
+how much benefit Elasticsearch brings, and they want to add their data to
+Elasticsearch as well.
+
+Fortunately, Elasticsearch supports
+http://en.wikipedia.org/wiki/Multitenancy[multitenancy] so each new user can
+have their own index in the same cluster.  Occasionally, somebody will want to
+search across the documents for all users, which they can do by searching
+across all indices, but most of the time, users are only interested in their
+own documents.
+
+Some users have more documents than others and some users will have heavier
+search loads than others, so the ability to specify how many primary shards
+and replica shards each index should have fits well with the index-per-user
+model. Similarly, busier indices can be allocated to stronger boxes with shard
+allocation filtering. (See <<migrate-indices>>.)
+
+TIP: Don't just use the default number of primary shards for every index.
+Think about how much data that index needs to hold.  It may be that all you
+need is one shard -- any more is a waste of resources.
+
+Most users of Elasticsearch can stop here.  A simple index-per-user approach
+is sufficient for the majority of cases.
+
+In exceptional cases, you may find that you need to support a large number of
+users, all with similar needs.  An example might be hosting a search engine
+for thousands of email forums.  Some forums may have a huge amount of traffic,
+but the majority of forums are quite small.  Dedicating an index with a single
+shard to a small forum is overkill -- a single shard could hold the data for
+many forums.
+
+What we need is a way to share resources across users, to give the impression
+that each user has their own index without wasting resources on small users.
+
diff --git a/410_Scaling/65_Shared_index.asciidoc b/410_Scaling/65_Shared_index.asciidoc
@@ -0,0 +1,152 @@
+[[shared-index]]
+=== Shared index
+
+We can use a large shared index for the many smaller forums by indexing
+the forum identifier in a field and using it as a filter:
+
+[source,json]
+------------------------------
+PUT /forums
+{
+  "settings": {
+    "number_of_shards": 10 <1>
+  },
+  "mappings": {
+    "post": {
+      "properties": {
+        "forum_id": { <2>
+          "type":  "string",
+          "index": "not_analyzed"
+        }
+      }
+    }
+  }
+}
+
+PUT /forums/post/1
+{
+  "forum_id": "baking", <2>
+  "title":    "Easy recipe for ginger nuts",
+  ...
+}
+------------------------------
+<1> Create an index large enough to hold thousands of smaller forums.
+<2> Each post must include a `forum_id` to identify which forum it belongs
+    to.
+
+We can use the `forum_id` as a filter to search within a single forum.  The
+filter will exclude most of the documents in the index (those from other
+forums) and filter caching will ensure that responses are fast:
+
+[source,json]
+------------------------------
+GET /forums/post/_search
+{
+  "query": {
+    "filtered": {
+      "query": {
+        "match": {
+          "title": "ginger nuts"
+        }
+      },
+      "filter": {
+        "term": { <1>
+          "forum_id": {
+            "baking"
+          }
+        }
+      }
+    }
+  }
+}
+------------------------------
+<1> The `term` filter is cached by default.
+
+This approach works, but we can do better.  The posts from a single forum
+would fit easily onto one shard but currently they are scattered across all 10
+shards in the index. This means that every search request has to be forwarded
+to a primary or replica of all 10 shards. What would be ideal is to ensure
+that all the posts from a single forum are stored on the same shard.
+
+In <<routing-value>>, we explained that a document is allocated to a
+particular shard using this formula:
+
+    shard = hash(routing) % number_of_primary_shards
+
+The `routing` value defaults to the document's `_id`, but we can override that
+and provide our own custom routing value, such as the `forum_id`.  All
+documents with the same `routing` value will be stored on the same shard:
+
+[source,json]
+------------------------------
+PUT /forums/post/1?routing=baking <1>
+{
+  "forum_id": "baking", <1>
+  "title":    "Easy recipe for ginger nuts",
+  ...
+}
+------------------------------
+<1> Using the `forum_id` as the routing value ensures that all posts from the
+    same forum are stored on the same shard.
+
+When we search for posts in a particular forum, we can pass the same `routing`
+value to ensure that the search request is only run on the single shard that
+holds our documents:
+
+[source,json]
+------------------------------
+GET /forums/post/_search?routing=baking <1>
+{
+  "query": {
+    "filtered": {
+      "query": {
+        "match": {
+          "title": "ginger nuts"
+        }
+      },
+      "filter": {
+        "term": { <2>
+          "forum_id": {
+            "baking"
+          }
+        }
+      }
+    }
+  }
+}
+------------------------------
+<1> The query is only run on the shard that corresponds to this `routing` value.
+<2> We still need the filter, as a single shard can hold posts from many forums.
+
+Multiple forums can be queried by passing a comma-separated list of `routing`
+values, and including each `forum_id` in a `terms` filter:
+
+[source,json]
+------------------------------
+GET /forums/post/_search?routing=baking,cooking,recipes <1>
+{
+  "query": {
+    "filtered": {
+      "query": {
+        "match": {
+          "title": "ginger nuts"
+        }
+      },
+      "filter": {
+        "terms": {
+          "forum_id": {
+            [ "baking", "cooking", "recipes" ]
+          }
+        }
+      }
+    }
+  }
+}
+------------------------------
+
+While this approach is technically efficient, it looks a bit clumsy because of
+the need to specify `routing` values and `terms` filters on every query or
+indexing request.  Things look a lot better once we add index aliases into the
+mix.
+
+
diff --git a/410_Scaling/70_Faking_it.asciidoc b/410_Scaling/70_Faking_it.asciidoc
@@ -0,0 +1,72 @@
+[[faking-it]]
+=== Faking index-per-user with aliases
+
+To keep things simple and clean, we would like our application to believe that
+we have a dedicated index per user -- or per forum in our example -- even if
+the reality is that we are using one big <<shared-index,shared index>>. To do
+that, we need some way to hide the `routing` value and the filter on
+`forum_id`.
+
+Index aliases allow us to do just that. When you associate an alias with an
+index, you can also specify a filter and routing values:
+
+[source,json]
+------------------------------
+PUT /forums/_alias/baking
+{
+  "routing": "baking",
+  "filter": {
+    "term": {
+      "forum_id": "baking"
+    }
+  }
+}
+------------------------------
+
+Now, we can treat the `baking` alias as if it were its own index.  Documents
+indexed into the `baking` alias automatically get the custom routing value
+applied:
+
+[source,json]
+------------------------------
+PUT /baking/post/1 <1>
+{
+  "forum_id": "baking", <1>
+  "title":    "Easy recipe for ginger nuts",
+  ...
+}
+------------------------------
+<1> We still need the `forum_id` field for the filter to work, but
+    the custom routing value is now implicit.
+
+Queries run against the `baking` alias are run just on the shard associated
+with the custom routing value, and the results are automatically filtered by
+the filter we specified:
+
+[source,json]
+------------------------------
+GET /baking/post/_search
+{
+  "query": {
+    "match": {
+      "title": "ginger nuts"
+    }
+  }
+}
+------------------------------
+
+Multiple aliases can be specified when searching across multiple forums:
+
+[source,json]
+------------------------------
+GET /baking,recipes/post/_search <1>
+{
+  "query": {
+    "match": {
+      "title": "ginger nuts"
+    }
+  }
+}
+------------------------------
+<1> Both `routing` values are applied, and results can match either filter.
+
diff --git a/410_Scaling/75_One_big_user.asciidoc b/410_Scaling/75_One_big_user.asciidoc
@@ -0,0 +1,67 @@
+[[one-big-user]]
+=== One big user
+
+Big popular forums start out as small forums.  One day we will find that one
+shard in our shared index is doing a lot more work than the other shards,
+because it holds the documents for a forum which has become very popular. That
+forum now needs its own index.
+
+The index aliases that we're using to fake an index-per-user give us a clean
+migration path for the big forum.
+
+The first step is to create a new index dedicated to the forum, and with the
+appropriate number of shards to allow for expected growth:
+
+[source,json]
+------------------------------
+PUT /baking_v1
+{
+  "settings": {
+    "number_of_shards": 3
+  }
+}
+------------------------------
+
+The next step is to migrate the data from the shared index into the dedicated
+index, which can be done using <<scan-scroll,scan and scroll>> and the
+<<bulk,`bulk` API>>.  As soon as the migration is finished, the index alias
+can be updated to point to the new index:
+
+[source,json]
+------------------------------
+POST /_aliases
+{
+  "actions": [
+    { "remove": { "alias": "baking", "index": "forums"    }},
+    { "add":    { "alias": "baking", "index": "baking_v1" }}
+  ]
+}
+------------------------------
+
+Updating the alias is atomic, it's like throwing a switch.  Your application
+continues talking to the `baking` API and is completely unaware that it now
+points to a new dedicated index.
+
+The dedicated index no longer needs the filter or the routing values. We can
+just rely on the default sharding that Elasticsearch does using each
+document's `_id` field.
+
+The last step is to remove the old documents from the shared index, which can
+be done with a `delete-by-query` request, using the original routing value and
+forum ID:
+
+[source,json]
+------------------------------
+DELETE /forums/post/_query?routing=baking
+{
+  "query": {
+    "term": {
+      "forum_id": "baking"
+    }
+  }
+}
+------------------------------
+
+The beauty of this index-per-user model is that it allows you to reduce
+resources, keeping costs low, while still giving you the flexibility to scale
+out when necessary, and with zero downtime.
diff --git a/410_Scaling/80_Scale_is_not_infinite.asciidoc b/410_Scaling/80_Scale_is_not_infinite.asciidoc
diff --git a/book.asciidoc b/book.asciidoc