You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 21, 2021. It is now read-only.
Copy file name to clipboardexpand all lines: 300_Aggregations/75_sigterms.asciidoc
+19-17
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
2
2
=== significant_terms Demo
3
3
4
-
Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing
4
+
Because the `significant_terms` aggregation((("significant_terms aggregation", "demonstration of")))((("aggregations", "significant_terms", "demonstration of"))) works by analyzing
5
5
statistics, you need to have a certain threshold of data for it to become effective.
6
6
That means we won't be able to index a small amount of example data for the demo.
7
7
8
8
Instead, we have a pre-prepared dataset of around 80,000 documents. This is
9
9
saved as a snapshot (for more information about snapshots and restore, see
10
-
<<backing-up-your-cluster>>) in our public demo repository. You can "restore"
10
+
<<backing-up-your-cluster>>) in our public demo repository. You can "restore"
11
11
this dataset into your cluster by using these commands:
12
12
13
13
[source,js]
@@ -26,6 +26,7 @@ POST /_snapshot/sigterms/snapshot/_restore <3>
26
26
27
27
GET /mlmovies,mlratings/_recovery <4>
28
28
----
29
+
// SENSE: 300_Aggregations/20_basic_example.json
29
30
<1> Register a new read-only URL repository pointing at the demo snapshot
30
31
<2> (Optional) Inspect the repository to learn details about available snapshots
31
32
<3> Begin the Restore process. This will download two indices into your cluster: `mlmovies`
@@ -69,12 +70,13 @@ GET mlmovies/_search <1>
69
70
},
70
71
....
71
72
----
72
-
<1> Execute a search without a query, so we can see a random sampling of docs.
73
+
// SENSE: 300_Aggregations/20_basic_example.json
74
+
<1> Execute a search without a query, so that we can see a random sampling of docs.
73
75
74
76
Each document in `mlmovies` represents a single movie. The two important pieces
75
77
of data are the `_id` of the movie and the `title` of the movie. You can ignore
76
78
`offset` and `bytes`; they are artifacts of the process used to extract this
77
-
data from the original CSV files. There are 10,681 movies in this dataset.
79
+
data from the original CSV files. There are 10,681 movies in this dataset.
78
80
79
81
Now let's look at `mlratings`:
80
82
@@ -105,7 +107,7 @@ GET mlratings/_search
105
107
],
106
108
"user": 1
107
109
}
108
-
},
110
+
},
109
111
...
110
112
----
111
113
@@ -186,9 +188,9 @@ since we are interested only in the aggregation results.
186
188
<3> Finally, find the most popular movies by using a `terms` bucket.
187
189
188
190
We perform the search on the `mlratings` index, and apply a filter for the ID of
189
-
_Talladega Nights_. Since aggregations operate on query scope, this will
190
-
effectively filter the aggregation results to only the users who recommended
191
-
_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most
191
+
_Talladega Nights_. Since aggregations operate on query scope, this will
192
+
effectively filter the aggregation results to only the users who recommended
193
+
_Talladega Nights_. Finally, we execute ((("terms aggregation", "movie recommendations (example)")))a `terms` aggregation to bucket the most
192
194
popular movies. We are requesting the top six results, since it is likely
193
195
that _Talladega Nights_ itself will be returned as a hit (and we don't want
194
196
to recommend the same movie).
@@ -271,7 +273,7 @@ well-liked, which means they are popular on everyone's recommendations. The
271
273
list is basically a recommendation of popular movies, not recommendations related
272
274
to _Talladega Nights_.
273
275
274
-
This is easily verified by running the aggregation again, but without the filter
276
+
This is easily verified by running the aggregation again, but without the filter
275
277
on _Talladega Nights_. This will give a top-five most popular movie list:
276
278
277
279
[source,js]
@@ -303,7 +305,7 @@ discriminating recommender.
303
305
==== Recommending Based on Statistics
304
306
305
307
Now that the scene is set, let's try using `significant_terms`. `significant_terms` will analyze
306
-
the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and
308
+
the group of people who enjoy _Talladega Nights_ (the _foreground_ group) and
307
309
determine what movies are most popular. ((("statistics, movie recommendations based on (example)"))) It will then construct a list of
308
310
popular films for everyone (the _background_ group) and compare the two.
309
311
@@ -356,11 +358,11 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
356
358
"doc_count": 271, <1>
357
359
"buckets": [
358
360
{
359
-
"key": 46970,
361
+
"key": 46970,
360
362
"key_as_string": "46970",
361
-
"doc_count": 271,
363
+
"doc_count": 271,
362
364
"score": 256.549815498155,
363
-
"bg_count": 271
365
+
"bg_count": 271
364
366
},
365
367
{
366
368
"key": 52245, <2>
@@ -402,17 +404,17 @@ extra ((("buckets", "returned by significant_terms aggregation")))metadata:
402
404
----
403
405
<1> The top-level `doc_count` shows the number of docs in the foreground group.
404
406
<2> Each bucket lists the key (for example, movie ID) being aggregated.
405
-
<3> A `doc_count` for that bucket.
406
-
<4> And a background count, which shows the rate at which this value appears in
407
+
<3> A `doc_count` for that bucket.
408
+
<4> And a background count, which shows the rate at which this value appears in
407
409
the entire background.
408
410
409
-
You can see that the first bucket we get back is _Talladega Nights_. It is
411
+
You can see that the first bucket we get back is _Talladega Nights_. It is
410
412
found in all 271 documents, which is not surprising. Let's look at the next bucket:
411
413
key `52245`.
412
414
413
415
This ID corresponds to _Blades of Glory_, a comedy about male figure skating
414
416
that also stars Will Ferrell. We can see that it was recommended 59 times by
415
-
the people who also liked _Talladega Nights_. This means that 21% of the foreground
417
+
the people who also liked _Talladega Nights_. This means that 21% of the foreground
416
418
group recommended _Blades of Glory_ (`59 / 271 = 0.2177`).
417
419
418
420
In contrast, _Blades of Glory_ was recommended only 185 times in the entire dataset,
0 commit comments