Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit 4737f14

Browse files
committed
Cleanup database language, start overhaul of mappings/types
1 parent 268071a commit 4737f14

File tree

3 files changed

+114
-108
lines changed

3 files changed

+114
-108
lines changed

010_Intro/25_Tutorial_Indexing.asciidoc

+7-17
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ So, sit back and enjoy a whirlwind tour of what Elasticsearch is capable of.
1616
We happen((("employee directory, building (example)"))) to work for _Megacorp_, and as part of HR's new _"We love our
1717
drones!"_ initiative, we have been tasked with creating an employee directory.
1818
The directory is supposed to foster employer empathy and
19-
real-time, synergistic, dynamic collaboration, so it has a few
19+
real-time, synergistic, dynamic collaboration, so it has a few
2020
business requirements:
2121

2222
* Enable data to contain multi value tags, numbers, and full text.
@@ -34,17 +34,10 @@ of an _employee document_: a single document represents a single
3434
employee. The act of storing data in Elasticsearch is called _indexing_, but
3535
before we can index a document, we need to decide _where_ to store it.
3636

37-
In Elasticsearch, a document belongs to a _type_, and those((("types"))) types live inside
38-
an _index_. ((("indices")))You can draw some (rough) parallels to a traditional relational database:
3937

40-
----
41-
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
42-
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
43-
----
44-
45-
An Elasticsearch cluster can((("clusters", "indices (databases) in")))((("databases", "in clusters"))) contain multiple _indices_ (databases), which in
46-
turn contain multiple _types_ (tables).((("tables"))) These types hold multiple _documents_
47-
(rows), and ((("rows")))each document has((("fields")))((("columns"))) multiple _fields_ (columns).
38+
An Elasticsearch cluster can((("clusters", "indices in")))(((in clusters"))) contain multiple _indices_, which in
39+
turn contain multiple _types_.((("tables"))) These types hold multiple _documents_,
40+
and each document has((("fields"))) multiple _fields_.
4841

4942
.Index Versus Index Versus Index
5043
**************************************************
@@ -108,11 +101,11 @@ information:
108101

109102
+megacorp+::
110103
The index name
111-
104+
112105
+employee+::
113106
The type name
114-
115-
+1+::
107+
108+
+1+::
116109
The ID of this particular employee
117110

118111
The request body--the JSON document--contains all the information about
@@ -147,6 +140,3 @@ PUT /megacorp/employee/3
147140
}
148141
--------------------------------------------------
149142
// SENSE: 010_Intro/25_Index.json
150-
151-
152-

030_Data/05_Document.asciidoc

+24-25
Original file line numberDiff line numberDiff line change
@@ -39,26 +39,30 @@ other objects. In Elasticsearch, the term _document_ has a specific meaning. It
3939
to the top-level, or root object that((("root object"))) is serialized into JSON and
4040
stored in Elasticsearch under a unique ID.
4141

42+
WARNING: Field names can be any valid string, but _may not_ include periods.
43+
4244
=== Document Metadata
4345

4446
A document doesn't consist only of its data.((("documents", "metadata"))) It also has
4547
_metadata_—information _about_ the document.((("metadata, document"))) The three required metadata
4648
elements are as follows:
4749

4850

49-
`_index`::
51+
`_index`::
5052
Where the document lives
51-
52-
`_type`::
53+
54+
`_type`::
5355
The class of object that the document represents
54-
55-
`_id`::
56+
57+
`_id`::
5658
The unique identifier for the document
5759

5860
==== _index
5961

60-
An _index_ is like a database in a relational database; it's the place
61-
we store and index related data.((("indices", "_index, in document metadata")))
62+
An _index_ is a collection of documents that should be grouped together for a
63+
common reason. For example, you may store all your products in a `products` index,
64+
while all your sales transactions go in `sales`. Although it is possible to store
65+
unrelated data together in a single index, it is often considered an anti-pattern.
6266

6367
[TIP]
6468
====
@@ -76,28 +80,23 @@ underscore, and cannot contain commas. Let's use `website` as our index name.
7680

7781
==== _type
7882

79-
In applications, we use objects to represent _things_ such as a user, a blog
80-
post, a comment, or an email. Each object belongs to a _class_ that defines
81-
the properties or data associated with an object. Objects in the `user` class
82-
may have a name, a gender, an age, and an email address.
83-
84-
In a relational database, we usually store objects of the same class in the
85-
same table, because they share the same data structure. For the same reason, in
86-
Elasticsearch we use the same _type_ for ((("types", "_type, in document metadata)))documents that represent the same
87-
class of _thing_, because they share the same data structure.
83+
Data may be grouped loosely together in an index, but often there are sub-partitions
84+
inside that data which may be useful to explicitly define. For example, all your
85+
products may go inside a single index. But you have different categories of products,
86+
such as "electronics", "kitchen" and "lawn-care".
8887

89-
Every _type_ has its own <<mapping,mapping>> or schema ((("mapping (types)")))((("schema definition, types")))definition, which
90-
defines the data structure for documents of that type, much like the columns
91-
in a database table. Documents of all types can be stored in the same index,
92-
but the _mapping_ for the type tells Elasticsearch how the data in each
93-
document should be indexed.
88+
The documents all share an identical (or very similar) schema: they have a title,
89+
description, product code, price. They just happen to belong to sub-categories
90+
under the umbrella of "Products".
9491

95-
We show how to specify and manage mappings in <<mapping>>, but for now
96-
we will rely on Elasticsearch to detect our document's data structure
97-
automatically.
92+
Elasticsearch exposes a feature called _types_ which allows you to logically
93+
partition data inside of an index. Documents in different types may have different
94+
fields, but it is best if they are highly similar. We'll talk more about the restrictions
95+
and applications of types in <<mapping>>.
9896

9997
A `_type` name can be lowercase or uppercase, but shouldn't begin with an
100-
underscore or contain commas.((("types", "names of"))) We will use `blog` for our type name.
98+
underscore or period. It also may not contain commas,((("types", "names of")))
99+
and is limited to a length of 256 characters. We will use `blog` for our type name.
101100

102101
==== _id
103102

070_Index_Mgmt/25_Mappings.asciidoc

+83-66
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,8 @@ documents of that type may have, ((("fields", "datatypes")))the datatype of each
88
`integer`, or `date`&#x2014;and how those fields should be indexed and stored by
99
Lucene.
1010

11-
In <<document>>, we said that a type is like a table in a relational database.
12-
While this is a useful way to think about types initially, it is worth
13-
explaining in more detail exactly what a type is and how they are implemented
14-
on top of Lucene.
11+
Types can be useful abstractions for partitioning similar-but-not-identical data.
12+
But due to how Lucene operates they come with some restrictions.
1513

1614
==== How Lucene Sees Documents
1715

@@ -28,8 +26,7 @@ also be _stored_ unchanged so that they can be retrieved later.
2826
==== How Types Are Implemented
2927

3028
Elasticsearch types are ((("types", "implementation in Elasticsearch")))implemented on top of this simple foundation. An index
31-
may have several types, each with its own mapping, and documents of any of
32-
these types may be stored in the same index.
29+
may have several types, and documents of any of these types may be stored in the same index.
3330

3431
Because Lucene has no concept of document types, the type name of each
3532
document is stored with the document in a metadata field called `_type`.((("type field"))) When
@@ -56,86 +53,106 @@ index called `name`:
5653

5754
==== Avoiding Type Gotchas
5855

59-
The fact that documents of different types can be added to the same index
60-
introduces some unexpected((("types", "gotchas, avoiding"))) complications.
56+
This leads to an interesting thought experiment: what happens if you have two
57+
different types, each with an identically named field but mapped differently
58+
(e.g. one is a string, the other is a number)?
6159

62-
Imagine that we have two types in our index: `blog_en` for blog posts in
63-
English, and `blog_es` for blog posts in Spanish. Both types have a
64-
`title` field, but one type uses the `english` analyzer and
65-
the other type uses the `spanish` analyzer.
60+
Well, the short answer is that bad things happen and Elasticsearch won't allow you
61+
to define this mapping at all. You'd receive an exception when attempting to
62+
configure the mapping.
6663

67-
The problem is illustrated by the following query:
64+
The longer answer is that each Lucene index contains a single, flat schema
65+
for all fields. A particular field is either mapped as a string, or a number, but
66+
not both. And because types are a mechanism added by Elasticsearch _on top_
67+
of Lucene (in the form of a metadata `_type` field), all types in Elasticsearch
68+
ultimately share the same mapping.
69+
70+
Take for example this mapping of two types in the `data` index:
6871

6972
[source,js]
7073
--------------------------------------------------
71-
GET /_search
7274
{
73-
"query": {
74-
"match": {
75-
"title": "The quick brown fox"
76-
}
77-
}
75+
"data": {
76+
"mappings": {
77+
"people": {
78+
"properties": {
79+
"name": {
80+
"type": "string",
81+
},
82+
"address": {
83+
"type": "string"
84+
}
85+
}
86+
},
87+
"transactions": {
88+
"properties": {
89+
"timestamp": {
90+
"type": "date",
91+
"format": "strict_date_optional_time"
92+
},
93+
"message": {
94+
"type": "string"
95+
}
96+
}
97+
}
98+
}
99+
}
78100
}
79101
--------------------------------------------------
80102

81-
82-
We are searching in the `title` field in both types. The query string needs
83-
to be analyzed, but which analyzer does it use: `spanish` or `english`? It
84-
will use the analyzer for the first `title` field that it finds, which
85-
will be correct for some docs and incorrect for the others.
86-
87-
We can avoid this problem either by naming the fields differently--for example, `title_en` and `title_es`&#x2014;or by explicitly including the type name in the
88-
field name and querying each field separately:
103+
Each type defines two fields (`"name"`/`"address"` and `"timestamp"`/`"message"`
104+
respectively). It may look like they are independent, but under the covers Lucene
105+
will create a single mapping which would look something like this:
89106

90107
[source,js]
91108
--------------------------------------------------
92-
GET /_search
93109
{
94-
"query": {
95-
"multi_match": { <1>
96-
"query": "The quick brown fox",
97-
"fields": [ "blog_en.title", "blog_es.title" ]
110+
"data": {
111+
"mappings": {
112+
"_type": {
113+
"type": "string",
114+
"index": "not_analyzed"
115+
},
116+
"name": {
117+
"type": "string"
98118
}
99-
}
119+
"address": {
120+
"type": "string"
121+
}
122+
"timestamp": {
123+
"type": "long"
124+
}
125+
"message": {
126+
"type": "string"
127+
}
128+
}
129+
}
100130
}
101131
--------------------------------------------------
102-
<1> The `multi_match` query runs a `match` query on multiple fields
103-
and combines the results.
104-
105-
Our new query uses the `english` analyzer for the field `blog_en.title` and
106-
the `spanish` analyzer for the field `blog_es.title`, and combines the results
107-
from both fields into an overall relevance score.
132+
_Note: This is not actually valid mapping syntax, just used for demonstration_
108133

109-
This solution can help when both fields have the same datatype, but consider
110-
what would happen if you indexed these two documents into the same index:
134+
The mappings are essentially _flattened_ into a single, global schema for the
135+
entire index. And that's why two types cannot define conflicting fields:
136+
Lucene wouldn't know what to do when the mappings are flattened together.
111137

112-
* Type: user
138+
==== Type Takeaways
113139

114-
[source,js]
115-
--------------------------------------------------
116-
{ "login": "john_smith" }
117-
--------------------------------------------------
118-
119-
[role="pagebreak-before"]
120-
* Type: event
121-
122-
[source,js]
123-
--------------------------------------------------
124-
{ "login": "2014-06-01" }
125-
--------------------------------------------------
140+
So what's the takeaway from this discussion? Technically, multiple types
141+
may live in the same index as long as their fields do not conflict (either because
142+
the fields are mutually exclusive, or because they share identical fields).
126143

127-
Lucene doesn't care that one field contains a string and the other field
128-
contains a date. It will happily index the byte values from both fields.
144+
Practically though, the important lesson is this: types are useful when you need
145+
to discriminate between different segments of a single collection. The overall "shape" of the
146+
data is identical (or nearly so) between the different segments
129147

130-
However, if we now try to _sort_ on the `event.login` field, Elasticsearch
131-
needs to load the values in the `login` field into memory. As we said in
132-
<<fielddata-intro>>, it loads the values for _all documents_ in the index
133-
regardless of their type.
148+
Types are not as well suited for _entirely different types of data_. If your two
149+
types have mutually exclusive sets of fields, that means half your index is going to
150+
contain "empty" values (the fields will be _sparse_), which will eventually cause performance
151+
problems. In these cases, it's much better to utilize two independent indices.
134152

135-
It will try to load these values either as a string or as a date, depending on
136-
which `login` field it sees first. This will either produce unexpected results
137-
or fail outright.
153+
In summary:
138154

139-
TIP: To ensure that you don't run into these conflicts, it is advisable to
140-
ensure that fields with the _same name_ are mapped in the _same way_ in every
141-
type in an index.
155+
- **Good:** `kitchen` and `lawn-care` types inside the `products` index, because
156+
the two types are essentially the same schema
157+
- **Bad:** `products` and `logs` types the `data` index, because the two types are
158+
mutually exclusive. Separate these into their own indices.

0 commit comments

Comments
 (0)