Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit 6fff3af

Browse files
polyfractalclintongormley
authored andcommittedNov 30, 2014
Add SigTerms Sense snippets
1 parent a98a19d commit 6fff3af

File tree

2 files changed

+127
-17
lines changed

2 files changed

+127
-17
lines changed
 

‎300_Aggregations/75_sigterms.asciidoc

+19-17
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11

22
=== significant_terms Demo
33

4-
Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing
4+
Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing
55
statistics, you need to have a certain threshold of data for it to become effective.
66
That means we won't be able to index a small amount of example data for the demo.
77

88
Instead, we have a pre-prepared dataset of around 80,000 documents. This is
99
saved as a snapshot (for more information about snapshots and restore, see
10-
<<backing-up-your-cluster>>) in our public demo repository. You can "restore"
10+
<<backing-up-your-cluster>>) in our public demo repository. You can "restore"
1111
this dataset into your cluster by using these commands:
1212

1313
[source,js]
@@ -26,6 +26,7 @@ POST /_snapshot/sigterms/snapshot/_restore <3>
2626
2727
GET /mlmovies,mlratings/_recovery <4>
2828
----
29+
// SENSE: 300_Aggregations/20_basic_example.json
2930
<1> Register a new read-only URL repository pointing at the demo snapshot
3031
<2> (Optional) Inspect the repository to learn details about available snapshots
3132
<3> Begin the Restore process. This will download two indices into your cluster: `mlmovies`
@@ -69,12 +70,13 @@ GET mlmovies/_search <1>
6970
},
7071
....
7172
----
72-
<1> Execute a search without a query, so we can see a random sampling of docs.
73+
// SENSE: 300_Aggregations/20_basic_example.json
74+
<1> Execute a search without a query, so that we can see a random sampling of docs.
7375

7476
Each document in `mlmovies` represents a single movie. The two important pieces
7577
of data are the `_id` of the movie and the `title` of the movie. You can ignore
7678
`offset` and `bytes`; they are artifacts of the process used to extract this
77-
data from the original CSV files. There are 10,681 movies in this dataset.
79+
data from the original CSV files. There are 10,681 movies in this dataset.
7880

7981
Now let's look at `mlratings`:
8082

@@ -105,7 +107,7 @@ GET mlratings/_search
105107
],
106108
"user": 1
107109
}
108-
},
110+
},
109111
...
110112
----
111113

@@ -186,9 +188,9 @@ since we are interested only in the aggregation results.
186188
<3> Finally, find the most popular movies by using a `terms` bucket.
187189

188190
We perform the search on the `mlratings` index, and apply a filter for the ID of
189-
_Talladega Nights_. Since aggregations operate on query scope, this will
190-
effectively filter the aggregation results to only the users who recommended
191-
_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most
191+
_Talladega Nights_. Since aggregations operate on query scope, this will
192+
effectively filter the aggregation results to only the users who recommended
193+
_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most
192194
popular movies. We are requesting the top six results, since it is likely
193195
that _Talladega Nights_ itself will be returned as a hit (and we don't want
194196
to recommend the same movie).
@@ -271,7 +273,7 @@ well-liked, which means they are popular on everyone's recommendations. The
271273
list is basically a recommendation of popular movies, not recommendations related
272274
to _Talladega Nights_.
273275

274-
This is easily verified by running the aggregation again, but without the filter
276+
This is easily verified by running the aggregation again, but without the filter
275277
on _Talladega Nights_. This will give a top-five most popular movie list:
276278

277279
[source,js]
@@ -303,7 +305,7 @@ discriminating recommender.
303305
==== Recommending Based on Statistics
304306

305307
Now that the scene is set, let's try using `significant_terms`. `significant_terms` will analyze
306-
the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and
308+
the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and
307309
determine what movies are most popular. ((("statistics, movie recommendations based on (example)"))) It will then construct a list of
308310
popular films for everyone (the _background_ group) and compare the two.
309311

@@ -356,11 +358,11 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
356358
"doc_count": 271, <1>
357359
"buckets": [
358360
{
359-
"key": 46970,
361+
"key": 46970,
360362
"key_as_string": "46970",
361-
"doc_count": 271,
363+
"doc_count": 271,
362364
"score": 256.549815498155,
363-
"bg_count": 271
365+
"bg_count": 271
364366
},
365367
{
366368
"key": 52245, <2>
@@ -402,17 +404,17 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
402404
----
403405
<1> The top-level `doc_count` shows the number of docs in the foreground group.
404406
<2> Each bucket lists the key (for example, movie ID) being aggregated.
405-
<3> A `doc_count` for that bucket.
406-
<4> And a background count, which shows the rate at which this value appears in
407+
<3> A `doc_count` for that bucket.
408+
<4> And a background count, which shows the rate at which this value appears in
407409
the entire background.
408410

409-
You can see that the first bucket we get back is _Talladega Nights_. It is
411+
You can see that the first bucket we get back is _Talladega Nights_. It is
410412
found in all 271 documents, which is not surprising. Let's look at the next bucket:
411413
key `52245`.
412414

413415
This ID corresponds to _Blades of Glory_, a comedy about male figure skating
414416
that also stars Will Ferrell. We can see that it was recommended 59 times by
415-
the people who also liked _Talladega Nights_. This means that 21% of the foreground
417+
the people who also liked _Talladega Nights_. This means that 21% of the foreground
416418
group recommended _Blades of Glory_ (`59 / 271 = 0.2177`).
417419

418420
In contrast, _Blades of Glory_ was recommended only 185 times in the entire dataset,
+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Register the SigTerms demo repository
2+
PUT /_snapshot/sigterms
3+
{
4+
"type": "url",
5+
"settings": {
6+
"url": "http://download.elasticsearch.org/definitiveguide/sigterms_demo/"
7+
}
8+
}
9+
10+
# Inspect the repo to see what snapshots are available
11+
GET /_snapshot/sigterms/_all
12+
13+
# Begin the Restore process
14+
POST /_snapshot/sigterms/snapshot/_restore
15+
16+
# Monitor the Restore process as shards download
17+
GET /mlmovies,mlratings/_recovery
18+
19+
20+
### Recommendations based on Popularity
21+
22+
# Inspect the movie data with an empty search
23+
GET mlmovies/_search
24+
GET mlratings/_search
25+
26+
# Find the ID of "Talladega Nights"
27+
GET mlmovies/_search
28+
{
29+
"query": {
30+
"match": {
31+
"title": "Talladega Nights"
32+
}
33+
}
34+
}
35+
36+
# Use a terms agg to find most popular
37+
GET mlratings/_search?search_type=count
38+
{
39+
"query": {
40+
"filtered": {
41+
"filter": {
42+
"term": {
43+
"movie": 46970
44+
}
45+
}
46+
}
47+
},
48+
"aggs": {
49+
"most_popular": {
50+
"terms": {
51+
"field": "movie",
52+
"size": 6
53+
}
54+
}
55+
}
56+
}
57+
58+
# Correlate IDs back to original titles
59+
GET mlmovies/_search
60+
{
61+
"query": {
62+
"filtered": {
63+
"filter": {
64+
"ids": {
65+
"values": [2571,318,296,2959,260]
66+
}
67+
}
68+
}
69+
}
70+
}
71+
72+
# Find most popular movies overall
73+
GET mlratings/_search?search_type=count
74+
{
75+
"aggs": {
76+
"most_popular": {
77+
"terms": {
78+
"field": "movie",
79+
"size": 5
80+
}
81+
}
82+
}
83+
}
84+
85+
86+
### Recommendations based on Statistics
87+
88+
# Replace terms agg with SigTerms
89+
GET mlratings/_search?search_type=count
90+
{
91+
"query": {
92+
"filtered": {
93+
"filter": {
94+
"term": {
95+
"movie": 46970
96+
}
97+
}
98+
}
99+
},
100+
"aggs": {
101+
"most_sig": {
102+
"significant_terms": {
103+
"field": "movie",
104+
"size": 6
105+
}
106+
}
107+
}
108+
}

0 commit comments

Comments
 (0)