Add SigTerms Sense snippets

polyfractal · clintongormley · commit 6fff3afec8d0 · 2014-11-30T12:55:59.000+01:00
diff --git a/300_Aggregations/75_sigterms.asciidoc b/300_Aggregations/75_sigterms.asciidoc
@@ -1,13 +1,13 @@
 
 === significant_terms Demo
 
-Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing 
+Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing
 statistics, you need to have a certain threshold of data for it to become effective.
 That means we won't be able to index a small amount of example data for the demo.
 
 Instead, we have a pre-prepared dataset of around 80,000 documents.  This is
 saved as a snapshot (for more information about snapshots and restore, see
-<<backing-up-your-cluster>>) in our public demo repository.  You can "restore" 
+<<backing-up-your-cluster>>) in our public demo repository.  You can "restore"
 this dataset into your cluster by using these commands:
 
 [source,js]
@@ -26,6 +26,7 @@ POST /_snapshot/sigterms/snapshot/_restore <3>
 
 GET /mlmovies,mlratings/_recovery <4>
 ----
+// SENSE: 300_Aggregations/20_basic_example.json
 <1> Register a new read-only URL repository pointing at the demo snapshot
 <2> (Optional) Inspect the repository to learn details about available snapshots
 <3> Begin the Restore process.  This will download two indices into your cluster: `mlmovies`
@@ -69,12 +70,13 @@ GET mlmovies/_search <1>
          },
          ....
 ----
-<1> Execute a search without a query, so we can see a random sampling of docs.
+// SENSE: 300_Aggregations/20_basic_example.json
+<1> Execute a search without a query, so that we can see a random sampling of docs.
 
 Each document in `mlmovies` represents a single movie.  The two important pieces
 of data are the `_id` of the movie and the `title` of the movie.  You can ignore
 `offset` and `bytes`; they are artifacts of the process used to extract this
-data from the original CSV files. There are 10,681 movies in this dataset.  
+data from the original CSV files. There are 10,681 movies in this dataset.
 
 Now let's look at `mlratings`:
 
@@ -105,7 +107,7 @@ GET mlratings/_search
                ],
                "user": 1
             }
-         }, 
+         },
          ...
 ----
 
@@ -186,9 +188,9 @@ since we are interested only in the aggregation results.
 <3> Finally, find the most popular movies by using a `terms` bucket.
 
 We perform the search on the `mlratings` index, and apply a filter for the ID of
-_Talladega Nights_.  Since aggregations operate on query scope, this will 
-effectively filter the aggregation results to only the users who recommended 
-_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most 
+_Talladega Nights_.  Since aggregations operate on query scope, this will
+effectively filter the aggregation results to only the users who recommended
+_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most
 popular movies.  We are requesting the top six results, since it is likely
 that _Talladega Nights_ itself will be returned as a hit (and we don't want
 to recommend the same movie).
@@ -271,7 +273,7 @@ well-liked, which means they are popular on everyone's recommendations.  The
 list is basically a recommendation of popular movies, not recommendations related
 to _Talladega Nights_.
 
-This is easily verified by running the aggregation again, but without the filter 
+This is easily verified by running the aggregation again, but without the filter
 on _Talladega Nights_.  This will give a top-five most popular movie list:
 
 [source,js]
@@ -303,7 +305,7 @@ discriminating recommender.
 ==== Recommending Based on Statistics
 
 Now that the scene is set, let's try using `significant_terms`.  `significant_terms` will analyze
-the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and 
+the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and
 determine what movies are most popular. ((("statistics, movie recommendations based on (example)"))) It will then construct a list of
 popular films for everyone (the _background_ group) and compare the two.
 
@@ -356,11 +358,11 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
          "doc_count": 271, <1>
          "buckets": [
             {
-               "key": 46970, 
+               "key": 46970,
                "key_as_string": "46970",
-               "doc_count": 271, 
+               "doc_count": 271,
                "score": 256.549815498155,
-               "bg_count": 271 
+               "bg_count": 271
             },
             {
                "key": 52245, <2>
@@ -402,17 +404,17 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
 ----
 <1> The top-level `doc_count` shows the number of docs in the foreground group.
 <2> Each bucket lists the key (for example, movie ID) being aggregated.
-<3> A `doc_count` for that bucket.  
-<4> And a background count, which shows the rate at which this value appears in 
+<3> A `doc_count` for that bucket.
+<4> And a background count, which shows the rate at which this value appears in
 the entire background.
 
-You can see that the first bucket we get back is _Talladega Nights_.  It is 
+You can see that the first bucket we get back is _Talladega Nights_.  It is
 found in all 271 documents, which is not surprising.  Let's look at the next bucket:
 key `52245`.
 
 This ID corresponds to _Blades of Glory_, a comedy about male figure skating
 that also stars Will Ferrell.  We can see that it was recommended 59 times by
-the people who also liked _Talladega Nights_.  This means that 21% of the foreground 
+the people who also liked _Talladega Nights_.  This means that 21% of the foreground
 group recommended _Blades of Glory_ (`59 / 271 = 0.2177`).
 
 In contrast, _Blades of Glory_ was recommended only 185 times in the entire dataset,
diff --git a/snippets/300_Aggregations/75_sigterms.json b/snippets/300_Aggregations/75_sigterms.json
@@ -0,0 +1,108 @@
+# Register the SigTerms demo repository
+PUT /_snapshot/sigterms
+{
+    "type": "url",
+    "settings": {
+        "url": "http://download.elasticsearch.org/definitiveguide/sigterms_demo/"
+    }
+}
+
+# Inspect the repo to see what snapshots are available
+GET /_snapshot/sigterms/_all
+
+# Begin the Restore process
+POST /_snapshot/sigterms/snapshot/_restore
+
+# Monitor the Restore process as shards download
+GET /mlmovies,mlratings/_recovery
+
+
+### Recommendations based on Popularity
+
+# Inspect the movie data with an empty search
+GET mlmovies/_search
+GET mlratings/_search
+
+# Find the ID of "Talladega Nights"
+GET mlmovies/_search
+{
+  "query": {
+    "match": {
+      "title": "Talladega Nights"
+    }
+  }
+}
+
+# Use a terms agg to find most popular
+GET mlratings/_search?search_type=count
+{
+  "query": {
+    "filtered": {
+      "filter": {
+        "term": {
+          "movie": 46970
+        }
+      }
+    }
+  },
+  "aggs": {
+    "most_popular": {
+      "terms": {
+        "field": "movie",
+        "size": 6
+      }
+    }
+  }
+}
+
+# Correlate IDs back to original titles
+GET mlmovies/_search
+{
+  "query": {
+    "filtered": {
+      "filter": {
+        "ids": {
+          "values": [2571,318,296,2959,260]
+        }
+      }
+    }
+  }
+}
+
+# Find most popular movies overall
+GET mlratings/_search?search_type=count
+{
+  "aggs": {
+    "most_popular": {
+      "terms": {
+        "field": "movie",
+        "size": 5
+      }
+    }
+  }
+}
+
+
+### Recommendations based on Statistics
+
+# Replace terms agg with SigTerms
+GET mlratings/_search?search_type=count
+{
+  "query": {
+    "filtered": {
+      "filter": {
+        "term": {
+          "movie": 46970
+        }
+      }
+    }
+  },
+  "aggs": {
+    "most_sig": {
+      "significant_terms": {
+        "field": "movie",
+        "size": 6
+      }
+    }
+  }
+}