Cleanup database language, start overhaul of mappings/types

polyfractal · polyfractal · commit 4737f14c4f84 · 2016-04-13T15:02:03.000-04:00
diff --git a/010_Intro/25_Tutorial_Indexing.asciidoc b/010_Intro/25_Tutorial_Indexing.asciidoc
@@ -16,7 +16,7 @@ So, sit back and enjoy a whirlwind tour of what Elasticsearch is capable of.
 We happen((("employee directory, building (example)"))) to work for _Megacorp_, and as part of HR's new _"We love our
 drones!"_ initiative, we have been tasked with creating an employee directory.
 The directory is supposed to foster employer empathy and
-real-time, synergistic, dynamic collaboration, so it has a few 
+real-time, synergistic, dynamic collaboration, so it has a few
 business requirements:
 
 * Enable data to contain multi value tags, numbers, and full text.
@@ -34,17 +34,10 @@ of an _employee document_: a single document represents a single
 employee.  The act of storing data in Elasticsearch is called _indexing_, but
 before we can index a document, we need to decide _where_ to store it.
 
-In Elasticsearch, a document belongs to a _type_, and those((("types"))) types live inside
-an _index_. ((("indices")))You can draw some (rough) parallels to a traditional relational database:
 
-----
-Relational DB  ⇒ Databases ⇒ Tables ⇒ Rows      ⇒ Columns
-Elasticsearch  ⇒ Indices   ⇒ Types  ⇒ Documents ⇒ Fields
-----
-
-An Elasticsearch cluster can((("clusters", "indices (databases) in")))((("databases", "in clusters"))) contain multiple _indices_ (databases), which in
-turn contain multiple _types_ (tables).((("tables"))) These types hold multiple _documents_
-(rows), and ((("rows")))each document has((("fields")))((("columns"))) multiple _fields_ (columns).
+An Elasticsearch cluster can((("clusters", "indices in")))(((in clusters"))) contain multiple _indices_, which in
+turn contain multiple _types_.((("tables"))) These types hold multiple _documents_,
+and each document has((("fields"))) multiple _fields_.
 
 .Index Versus Index Versus Index
 **************************************************
@@ -108,11 +101,11 @@ information:
 
 +megacorp+::
       The index name
-      
+
 +employee+::
       The type name
-      
-+1+::          
+
++1+::
       The ID of this particular employee
 
 The request body--the JSON document--contains all the information about
@@ -147,6 +140,3 @@ PUT /megacorp/employee/3
 }
 --------------------------------------------------
 // SENSE: 010_Intro/25_Index.json
-
-
-
diff --git a/030_Data/05_Document.asciidoc b/030_Data/05_Document.asciidoc
@@ -39,26 +39,30 @@ other objects. In Elasticsearch, the term _document_ has a specific meaning. It
 to the top-level, or root object that((("root object"))) is serialized into JSON and
 stored in Elasticsearch under a unique ID.
 
+WARNING: Field names can be any valid string, but _may not_ include periods.
+
 === Document Metadata
 
 A document doesn't consist only of its data.((("documents", "metadata"))) It also has
 _metadata_&#x2014;information _about_ the document.((("metadata, document"))) The three required metadata
 elements are as follows:
 
 
- `_index`::  
+ `_index`::
    Where the document lives
-   
- `_type`::   
+
+ `_type`::
    The class of object that the document represents
-   
- `_id`::     
+
+ `_id`::
    The unique identifier for the document
 
 ==== _index
 
-An _index_ is like a database in a relational database; it's the place
-we store and index related data.((("indices", "_index, in document metadata")))
+An _index_ is a collection of documents that should be grouped together for a
+common reason.  For example, you may store all your products in a `products` index,
+while all your sales transactions go in `sales`.  Although it is possible to store
+unrelated data together in a single index, it is often considered an anti-pattern.
 
 [TIP]
 ====
@@ -76,28 +80,23 @@ underscore, and cannot contain commas. Let's use `website` as our index name.
 
 ==== _type
 
-In applications, we use objects to represent _things_ such as a user, a blog
-post, a comment, or an email. Each object belongs to a _class_ that defines
-the properties or data associated with an object. Objects in the `user` class
-may have a name, a gender, an age, and an email address.
-
-In a relational database, we usually store objects of the same class in the
-same table, because they share the same data structure. For the same reason, in
-Elasticsearch we use the same _type_ for ((("types", "&#x5f;type, in document metadata)))documents that represent the same
-class of _thing_, because they share the same data structure.
+Data may be grouped loosely together in an index, but often there are sub-partitions
+inside that data which may be useful to explicitly define.  For example, all your
+products may go inside a single index.  But you have different categories of products,
+such as "electronics", "kitchen" and "lawn-care".
 
-Every _type_ has its own <<mapping,mapping>> or schema ((("mapping (types)")))((("schema definition, types")))definition, which
-defines the data structure for documents of that type, much like the columns
-in a database table. Documents of all types can be stored in the same index,
-but the _mapping_ for the type tells Elasticsearch how the data in each
-document should be indexed.
+The documents all share an identical (or very similar) schema: they have a title,
+description, product code, price.  They just happen to belong to sub-categories
+under the umbrella of "Products".
 
-We show how to specify and manage mappings in <<mapping>>, but for now
-we will rely on Elasticsearch to detect our document's data structure
-automatically.
+Elasticsearch exposes a feature called _types_ which allows you to logically
+partition data inside of an index.  Documents in different types may have different
+fields, but it is best if they are highly similar.  We'll talk more about the restrictions
+and applications of types in <<mapping>>.
 
 A `_type` name can be lowercase or uppercase, but shouldn't begin with an
-underscore or contain commas.((("types", "names of")))  We will use `blog` for our type name.
+underscore or period.  It also may not contain commas,((("types", "names of")))
+and is limited to a length of 256 characters. We will use `blog` for our type name.
 
 ==== _id
 
diff --git a/070_Index_Mgmt/25_Mappings.asciidoc b/070_Index_Mgmt/25_Mappings.asciidoc
@@ -8,10 +8,8 @@ documents of that type may have, ((("fields", "datatypes")))the datatype of each
 `integer`, or `date`&#x2014;and how those fields should be indexed and stored by
 Lucene.
 
-In <<document>>, we said that a type is like a table in a relational database.
-While this is a useful way to think about types initially, it is worth
-explaining in more detail exactly what a type is and how they are implemented
-on top of Lucene.
+Types can be useful abstractions for partitioning similar-but-not-identical data.
+But due to how Lucene operates they come with some restrictions.
 
 ==== How Lucene Sees Documents
 
@@ -28,8 +26,7 @@ also be _stored_ unchanged so that they can be retrieved later.
 ==== How Types Are Implemented
 
 Elasticsearch types are ((("types", "implementation in Elasticsearch")))implemented on top of this simple foundation. An index
-may have several types, each with its own mapping, and documents of any of
-these types may be stored in the same index.
+may have several types, and documents of any of these types may be stored in the same index.
 
 Because Lucene has no concept of document types, the type name of each
 document is stored with the document in a metadata field called `_type`.((("type field"))) When
@@ -56,86 +53,106 @@ index called `name`:
 
 ==== Avoiding Type Gotchas
 
-The fact that documents of different types can be added to the same index
-introduces some unexpected((("types", "gotchas, avoiding"))) complications.
+This leads to an interesting thought experiment: what happens if you have two
+different types, each with an identically named field but mapped differently
+(e.g. one is a string, the other is a number)?
 
-Imagine that we have two types in our index: `blog_en` for blog posts in
-English, and `blog_es` for blog posts in Spanish.  Both types have a
-`title` field, but one type uses the `english` analyzer and
-the other type uses the `spanish` analyzer.
+Well, the short answer is that bad things happen and Elasticsearch won't allow you
+to define this mapping at all.  You'd receive an exception when attempting to
+configure the mapping.
 
-The problem is illustrated by the following query:
+The longer answer is that each Lucene index contains a single, flat schema
+for all fields.  A particular field is either mapped as a string, or a number, but
+not both.  And because types are a mechanism added by Elasticsearch _on top_
+of Lucene (in the form of a metadata `_type` field), all types in Elasticsearch
+ultimately share the same mapping.
+
+Take for example this mapping of two types in the `data` index:
 
 [source,js]
 --------------------------------------------------
-GET /_search
 {
-    "query": {
-        "match": {
-            "title": "The quick brown fox"
-        }
-    }
+   "data": {
+      "mappings": {
+         "people": {
+            "properties": {
+               "name": {
+                  "type": "string",
+               },
+               "address": {
+                  "type": "string"
+               }
+            }
+         },
+         "transactions": {
+            "properties": {
+               "timestamp": {
+                  "type": "date",
+                  "format": "strict_date_optional_time"
+               },
+               "message": {
+                  "type": "string"
+               }
+            }
+         }
+      }
+   }
 }
 --------------------------------------------------
 
-
-We are searching in the `title` field in both types.  The query string needs
-to be analyzed, but which analyzer does it use: `spanish` or `english`? It
-will use the analyzer for the first `title` field that it finds, which
-will be correct for some docs and incorrect for the others.
-
-We can avoid this problem either by naming the fields differently--for example, `title_en` and `title_es`&#x2014;or by explicitly including the type name in the
-field name and querying each field separately:
+Each type defines two fields (`"name"`/`"address"` and `"timestamp"`/`"message"`
+respectively).  It may look like they are independent, but under the covers Lucene
+will create a single mapping which would look something like this:
 
 [source,js]
 --------------------------------------------------
-GET /_search
 {
-    "query": {
-        "multi_match": { <1>
-            "query":    "The quick brown fox",
-            "fields": [ "blog_en.title", "blog_es.title" ]
+   "data": {
+      "mappings": {
+        "_type": {
+          "type": "string",
+          "index": "not_analyzed"
+        },
+        "name": {
+          "type": "string"
         }
-    }
+        "address": {
+          "type": "string"
+        }
+        "timestamp": {
+          "type": "long"
+        }
+        "message": {
+          "type": "string"
+        }
+      }
+   }
 }
 --------------------------------------------------
-<1> The `multi_match` query runs a `match` query on multiple fields
-    and combines the results.
-
-Our new query uses the `english` analyzer for the field `blog_en.title` and
-the `spanish` analyzer for the field `blog_es.title`, and combines the results
-from both fields into an overall relevance score.
+_Note: This is not actually valid mapping syntax, just used for demonstration_
 
-This solution can help when both fields have the same datatype, but consider
-what would happen if you indexed these two documents into the same index:
+The mappings are essentially _flattened_ into a single, global schema for the
+entire index.  And that's why two types cannot define conflicting fields:
+Lucene wouldn't know what to do when the mappings are flattened together.
 
-* Type: user
+==== Type Takeaways
 
-[source,js]
---------------------------------------------------
- { "login": "john_smith" }
---------------------------------------------------
-
-[role="pagebreak-before"]
-* Type: event
-
-[source,js]
---------------------------------------------------
- { "login": "2014-06-01" }
---------------------------------------------------
+So what's the takeaway from this discussion?  Technically, multiple types
+may live in the same index as long as their fields do not conflict (either because
+the fields are mutually exclusive, or because they share identical fields).
 
-Lucene doesn't care that one field contains a string and the other field
-contains a date. It will happily index the byte values from both fields.
+Practically though, the important lesson is this:  types are useful when you need
+to discriminate between different segments of a single collection. The overall "shape" of the
+data is identical (or nearly so) between the different segments
 
-However, if we now try to _sort_ on the `event.login` field, Elasticsearch
-needs to load the values in the `login` field into memory. As we said in
-<<fielddata-intro>>, it loads the values for  _all documents_ in the index
-regardless of their type.
+Types are not as well suited for _entirely different types of data_.  If your two
+types have mutually exclusive sets of fields, that means half your index is going to
+contain "empty" values (the fields will be _sparse_), which will eventually cause performance
+problems.  In these cases, it's much better to utilize two independent indices.
 
-It will try to load these values either as a string or as a date, depending on
-which `login` field it sees first. This will either produce unexpected results
-or fail outright.
+In summary:
 
-TIP: To ensure that you don't run into these conflicts, it is advisable to
-ensure that fields with the _same name_ are mapped in the _same way_ in every
-type in an index.
+- **Good:** `kitchen` and `lawn-care` types inside the `products` index, because
+the two types are essentially the same schema
+- **Bad:** `products` and `logs` types the `data` index, because the two types are
+mutually exclusive.  Separate these into their own indices.