Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit 0494371

Browse files
committed
Edited 220_Token_normalization/20_Removing_diacritics.asciidoc with Atlas code editor
1 parent 10bf37d commit 0494371

File tree

1 file changed

+29
-32
lines changed

1 file changed

+29
-32
lines changed

220_Token_normalization/20_Removing_diacritics.asciidoc

+29-32
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
11
[[asciifolding-token-filter]]
2-
=== You have an accent
2+
=== You Have an Accent
33

4-
English only uses diacritics (like `´`, `^` and `¨`) for imported words --
5-
like `rôle`, ++déjà++ and `däis` -- but usually they are optional. ((("diacritics")))((("tokens", "normalizing", "diacritics"))) Other
4+
English uses diacritics (like `´`, `^`, and `¨`) only for imported words--like `rôle`, ++déjà++, and `däis`—but usually they are optional. ((("diacritics")))((("tokens", "normalizing", "diacritics"))) Other
65
languages require diacritics in order to be correct. Of course, just because
76
words are spelled correctly in your index doesn't mean that the user will
87
search for the correct spelling.
98

109
It is often useful to strip diacritics from words, allowing `rôle` to match
11-
`role` and vice versa. With Western languages, this can be done with the
10+
`role`, and vice versa. With Western languages, this can be done with the
1211
`asciifolding` character filter.((("asciifolding character filter"))) Actually, it does more than just strip
1312
diacritics. It tries to convert many Unicode characters into a simpler ASCII
14-
representation, including:
13+
representation:
1514

1615
* `ß` => `ss`
1716
* `æ` => `ae`
@@ -22,7 +21,7 @@ representation, including:
2221
* `⁶` => `6`
2322

2423
Like the `lowercase` filter, the `asciifolding` filter doesn't require any
25-
configuration but can be included directly in a custom analyzer:
24+
configuration but can be included directly in a `custom` analyzer:
2625

2726
[source,js]
2827
--------------------------------------------------
@@ -45,32 +44,31 @@ My œsophagus caused a débâcle <1>
4544
--------------------------------------------------
4645
<1> Emits `my`, `oesophagus`, `caused`, `a`, `debacle`
4746

48-
==== Retaining meaning
47+
==== Retaining Meaning
4948

5049
Of course, when you strip diacritical marks from a word, you lose meaning.
5150
For instance, consider((("diacritics", "stripping, meaning loss from"))) these three ((("Spanis", "stripping diacritics, meaning loss from")))Spanish words:
5251

53-
[horizontal]
54-
`esta`:: Feminine form of the adjective ``this'' as in ``esta silla'' (this
55-
chair) or ``esta'' (this one).
52+
`esta`::
53+
Feminine form of the adjective _this_, as in _esta silla_ (this chair) or _esta_ (this one).
5654

57-
`ésta`:: An archaic form of `esta`.
55+
`ésta`::
56+
An archaic form of `esta`.
5857

59-
`está`:: The third person form of the verb ``estar'' (to be), as in ``está
60-
feliz'' (he is happy).
58+
`está`::
59+
The third-person form of the verb _estar_ (to be), as in _está feliz_ (he is happy).
6160

6261
While we would like to conflate the first two forms, they differ in meaning
6362
from the third form, which we would like to keep separate. Similarly:
6463

65-
[horizontal]
66-
`sé`:: The first person form of the verb ``saber'' (to know) as in ``Yo
67-
sé'' (I know).
64+
`sé`::
65+
The first person form of the verb _saber_ (to know) as in _Yo sé_ (I know).
6866

69-
`se`:: The third person reflexive pronoun used with many verbs, such as
70-
``se sabe'' (it is known).
67+
`se`::
68+
The third-person reflexive pronoun used with many verbs, such as _se sabe_ (it is known).
7169

72-
Unfortunately, there is no easy way to separate out words that should have
73-
their diacritics removed and words that shouldn't. And it is quite likely
70+
Unfortunately, there is no easy way to separate words that should have
71+
their diacritics removed from words that shouldn't. And it is quite likely
7472
that your users won't know either.
7573

7674
Instead, we index the text twice: once in the original form and once with
@@ -99,8 +97,8 @@ PUT /my_index/_mapping/my_type
9997
<2> The `title.folded` field uses the `folding` analyzer, which strips
10098
the diacritical marks.((("folding analyzer")))
10199

102-
You can test out the field mappings using the `analyze` API on the sentence
103-
``Esta está loca'' (This woman is crazy):
100+
You can test the field mappings by using the `analyze` API on the sentence
101+
_Esta está loca_ (This woman is crazy):
104102

105103
[source,js]
106104
--------------------------------------------------
@@ -125,7 +123,7 @@ PUT /my_index/my_type/2
125123
--------------------------------------------------
126124

127125
Now we can search across both fields, using the `multi_match` query in
128-
<<most-fields,`most_fields` mode>> to combine the scores from each field.
126+
<<most-fields,`most_fields` mode>> to combine the scores from each field:
129127

130128

131129
[source,js]
@@ -159,40 +157,39 @@ GET /my_index/_validate/query?explain
159157
}
160158
--------------------------------------------------
161159

162-
It searches for the original form of the word `está` in the `title` field,
160+
The `multi-match` query searches for the original form of the word (`está`) in the `title` field,
163161
and the form without diacritics `esta` in the `title.folded` field:
164162

165163
(title:está title:loca )
166164
(title.folded:esta title.folded:loca)
167165

168-
It doesn't matter whether the user searches for `esta` or `está` -- both
166+
It doesn't matter whether the user searches for `esta` or `está`; both
169167
documents will match because the form without diacritics exists in the the
170168
`title.folded` field. However, only the original form exists in the `title`
171169
field. This extra match will push the document containing the original form of
172170
the word to the top of the results list.
173171

174-
We use the `title.folded` field to ``widen the net'' in order to match more
172+
We use the `title.folded` field to _widen the net_ in order to match more
175173
documents, and use the original `title` field to push the most relevant
176174
document to the top. This same technique can be used wherever an analyzer is
177-
used to increase matches at the expense of meaning.
175+
used, to increase matches at the expense of meaning.
178176

179177
[TIP]
180178
=================================================
181179
182-
The `asciifolding` filter does have an option called `preserve_original` which
180+
The `asciifolding` filter does have an option called `preserve_original` that
183181
allows you to index the((("asciifolding character filter", "preserve_original option"))) original token and the folded token in the same
184182
position in the same field. With this option enabled, you would end up with
185-
something like:
183+
something like this:
186184
187185
Position 1 Position 2
188186
--------------------------
189187
(ésta,esta) loca
190188
--------------------------
191189
192190
While this appears to be a nice way to save space, it does mean that you have
193-
no way of saying ``Give me an exact match on the original word''. Mixing
194-
tokens with and without diacritics can also end up interfering with term
195-
frequency counts, resulting in less reliable relevance calcuations.
191+
no way of saying, ``Give me an exact match on the original word.'' Mixing
192+
tokens with and without diacritics can also end up interfering with term-frequency counts, resulting in less-reliable relevance calcuations.
196193
197194
As a rule, it is cleaner to index each field variant into a separate field,
198195
as we have done in this section.

0 commit comments

Comments
 (0)