You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 21, 2021. It is now read-only.
Copy file name to clipboardexpand all lines: 220_Token_normalization/20_Removing_diacritics.asciidoc
+29-32
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,16 @@
1
1
[[asciifolding-token-filter]]
2
-
=== You have an accent
2
+
=== You Have an Accent
3
3
4
-
English only uses diacritics (like `´`, `^` and `¨`) for imported words --
5
-
like `rôle`, ++déjà++ and `däis` -- but usually they are optional. ((("diacritics")))((("tokens", "normalizing", "diacritics"))) Other
4
+
English uses diacritics (like `´`, `^`, and `¨`) only for imported words--like `rôle`, ++déjà++, and `däis`—but usually they are optional. ((("diacritics")))((("tokens", "normalizing", "diacritics"))) Other
6
5
languages require diacritics in order to be correct. Of course, just because
7
6
words are spelled correctly in your index doesn't mean that the user will
8
7
search for the correct spelling.
9
8
10
9
It is often useful to strip diacritics from words, allowing `rôle` to match
11
-
`role` and vice versa. With Western languages, this can be done with the
10
+
`role`, and vice versa. With Western languages, this can be done with the
12
11
`asciifolding` character filter.((("asciifolding character filter"))) Actually, it does more than just strip
13
12
diacritics. It tries to convert many Unicode characters into a simpler ASCII
14
-
representation, including:
13
+
representation:
15
14
16
15
* `ß` => `ss`
17
16
* `æ` => `ae`
@@ -22,7 +21,7 @@ representation, including:
22
21
* `⁶` => `6`
23
22
24
23
Like the `lowercase` filter, the `asciifolding` filter doesn't require any
25
-
configuration but can be included directly in a custom analyzer:
24
+
configuration but can be included directly in a `custom` analyzer:
Of course, when you strip diacritical marks from a word, you lose meaning.
51
50
For instance, consider((("diacritics", "stripping, meaning loss from"))) these three ((("Spanis", "stripping diacritics, meaning loss from")))Spanish words:
52
51
53
-
[horizontal]
54
-
`esta`:: Feminine form of the adjective ``this'' as in ``esta silla'' (this
55
-
chair) or ``esta'' (this one).
52
+
`esta`::
53
+
Feminine form of the adjective _this_, as in _esta silla_ (this chair) or _esta_ (this one).
56
54
57
-
`ésta`:: An archaic form of `esta`.
55
+
`ésta`::
56
+
An archaic form of `esta`.
58
57
59
-
`está`:: The third person form of the verb ``estar'' (to be), as in ``está
60
-
feliz'' (he is happy).
58
+
`está`::
59
+
The third-person form of the verb _estar_ (to be), as in _está feliz_ (he is happy).
61
60
62
61
While we would like to conflate the first two forms, they differ in meaning
63
62
from the third form, which we would like to keep separate. Similarly:
64
63
65
-
[horizontal]
66
-
`sé`:: The first person form of the verb ``saber'' (to know) as in ``Yo
67
-
sé'' (I know).
64
+
`sé`::
65
+
The first person form of the verb _saber_ (to know) as in _Yo sé_ (I know).
68
66
69
-
`se`:: The third person reflexive pronoun used with many verbs, such as
70
-
``se sabe'' (it is known).
67
+
`se`::
68
+
The third-person reflexive pronoun used with many verbs, such as _se sabe_ (it is known).
71
69
72
-
Unfortunately, there is no easy way to separate out words that should have
73
-
their diacritics removed and words that shouldn't. And it is quite likely
70
+
Unfortunately, there is no easy way to separate words that should have
71
+
their diacritics removed from words that shouldn't. And it is quite likely
74
72
that your users won't know either.
75
73
76
74
Instead, we index the text twice: once in the original form and once with
@@ -99,8 +97,8 @@ PUT /my_index/_mapping/my_type
99
97
<2> The `title.folded` field uses the `folding` analyzer, which strips
100
98
the diacritical marks.((("folding analyzer")))
101
99
102
-
You can test out the field mappings using the `analyze` API on the sentence
103
-
``Esta está loca'' (This woman is crazy):
100
+
You can test the field mappings by using the `analyze` API on the sentence
0 commit comments