Revisiting the Social-graph

We are going to end this chapter with a bit of a thought experiment by revisiting the social-graph example that we used earlier in [_social_graph_filter].

In that example, we wanted to find the tweets of all the users whom a particular person followed. Since it is possible for a user to follow a large number of users, we used the lookup capability of the terms filter. This allowed us to avoid sending a list of 10,000 terms in the query itself.

But even though we have optimized the delivery of terms to the filter (e.g. extracting from a document instead of sending over the wire), the underlying process is fundamentally the same. We are performing 10,000 individual term lookups. And this only gets worse as you continue to follow more people.

So while the lookup capability has helped considerably, it is a band-aid and not a true fix. The problem isn’t the filter…the problem is how the data has been organized. In our old example, we had a document-per-user which listed who that user followed:

PUT /my_index/user_following/1
{ "following" : [2, 4] }

This document is updated as needed, and used for the terms filter lookup. But by centralizing the data into a single document, we are forced to use a terms filter with potentially thousands of terms (regardless of request body vs lookup method).

Let’s invert the structure and decentralize the data. Instead of storing who a user follows in a separate document, let’s store who a user is followed by right in the user document. Our user documents become the source of "following data", rather than a secondary document:

PUT /my_index/users/2
{
    "name" : "Zach",
    "joined" : "2014-10-28",
    "followed_by" : [1, 5, 10]
}

And now, instead of a terms filter with thousands of terms, we can use just a single term filter looking for a single term:

GET /my_index/users/_search
{
  "query" : {
    "filtered" : {
      "filter" : {
        "term" : {
          "followed_by" : 1
        }
      }
    }
  }
}

The results are the same as before, but we’ve boiled our query down to a single filter. For both performance and simplicity,we gain several advantages:

Avoids a document lookup to get the list of IDs. Even if this is fast, it is still slower than not doing a lookup at all
Caches a single filter instead of potentially thousands
The overhead of updating documents is identical because in both cases, we only update a single document.
Avoids a secondary document type
Simplifies the query structure

All of that from simply reorganizing our data. You’ll see that this is a very common pattern in Elasticsearch. There are many ways to tackle any particular problem — but certain arrangements of data may work better.

In particular, try to retrain your brain from thinking in terms of denormalized relations. The first architecture (centralized document with all "following" data) is very natural to people coming from a relational database.

Moving that data to the user documents itself may seem unnatural, but in many cases can work substantially better as seen here. When thinking about data organization and query structure, think about how you would like to search for your data rather than how you would like to store it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

55_revisittermslookup.asciidoc

55_revisittermslookup.asciidoc

Files

55_revisittermslookup.asciidoc

Latest commit

History

55_revisittermslookup.asciidoc

File metadata and controls

Revisiting the Social-graph