The Japanese (kuromoji) analysis plugin integrates Lucene kuromoji analysis module into {es}.
The kuromoji
analyzer uses the following analysis chain:
-
CJKWidthCharFilter
from Lucene -
kuromoji_baseform
token filter -
kuromoji_part_of_speech
token filter -
ja_stop
token filter -
kuromoji_stemmer
token filter -
{ref}/analysis-lowercase-tokenfilter.html[
lowercase
] token filter
It supports the mode
and user_dictionary
settings from
kuromoji_tokenizer
.
The kuromoji_tokenizer
tokenizer uses characters from the MeCab-IPADIC
dictionary to split text into tokens. The dictionary includes some full-width
characters, such as o
and f
. If a text contains full-width characters,
the tokenizer can produce unexpected tokens.
For example, the kuromoji_tokenizer
tokenizer converts the text
Culture of Japan
to the tokens [ culture, o, f, japan ]
instead of [ culture, of, japan ]
.
To avoid this, add the icu_normalizer
character filter to a custom analyzer based on the kuromoji
analyzer. The
icu_normalizer
character filter converts full-width characters to their normal
equivalents.
First, duplicate the kuromoji
analyzer to create the basis for a custom
analyzer. Then add the icu_normalizer
character filter to the custom analyzer.
For example:
PUT index-00001
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"kuromoji_normalize": { (1)
"char_filter": [
"icu_normalizer" (2)
],
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_baseform",
"kuromoji_part_of_speech",
"cjk_width",
"ja_stop",
"kuromoji_stemmer",
"lowercase"
]
}
}
}
}
}
}
-
Creates a new custom analyzer,
kuromoji_normalize
, based on thekuromoji
analyzer. -
Adds the
icu_normalizer
character filter to the analyzer.
The kuromoji_iteration_mark
normalizes Japanese horizontal iteration marks
(odoriji) to their expanded form. It accepts the following settings:
normalize_kanji
-
Indicates whether kanji iteration marks should be normalized. Defaults to
true
. normalize_kana
-
Indicates whether kana iteration marks should be normalized. Defaults to
true
The kuromoji_tokenizer
accepts the following settings:
mode
-
The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:
normal
-
Normal segmentation, no decomposition for compounds. Example output:
関西国際空港 アブラカダブラ
search
-
Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:
関西, 関西国際空港, 国際, 空港 アブラカダブラ
extended
-
Extended mode outputs unigrams for unknown words. Example output:
関西, 関西国際空港, 国際, 空港 ア, ブ, ラ, カ, ダ, ブ, ラ
discard_punctuation
-
Whether punctuation should be discarded from the output. Defaults to
true
. lenient
-
Whether the
user_dictionary
should be deduplicated on the providedtext
. False by default causing duplicates to generate an error. user_dictionary
-
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A
user_dictionary
may be appended to the default dictionary. The dictionary should have the following CSV format:<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
As a demonstration of how the user dictionary can be used, save the following
dictionary to $ES_HOME/config/userdict_ja.txt
:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
You can also inline the rules directly in the tokenizer definition using
the user_dictionary_rules
option:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}
nbest_cost
/nbest_examples
-
Additional expert user parameters
nbest_cost
andnbest_examples
can be used to include additional tokens that are most likely according to the statistical model. If both parameters are used, the largest number of both is applied.nbest_cost
-
The
nbest_cost
parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path. nbest_examples
-
The
nbest_examples
can be used to find anbest_cost
value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).
Then create an analyzer as follows:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary": "userdict_ja.txt",
"lenient": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "東京スカイツリー"
}
The above analyze
request returns the following:
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]
}
discard_compound_token
-
Whether original compound tokens should be discarded from the output with
search
mode. Defaults tofalse
. Example output withsearch
orextended
mode and this optiontrue
:関西, 国際, 空港
Note
|
If a text contains full-width characters, the kuromoji_tokenizer
tokenizer can produce unexpected tokens. To avoid this, add the
icu_normalizer character filter to
your analyzer. See Normalize full-width characters.
|
The kuromoji_baseform
token filter replaces terms with their
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_baseform"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "飲み"
}
which responds with:
{
"tokens" : [ {
"token" : "飲む",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
} ]
}
The kuromoji_part_of_speech
token filter removes tokens that match a set of
part-of-speech tags. It accepts the following setting:
stoptags
-
An array of part-of-speech tags that should be removed. It defaults to the
stoptags.txt
file embedded in thelucene-analyzer-kuromoji.jar
.
For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "寿司がおいしいね"
}
Which responds with:
{
"tokens" : [ {
"token" : "寿司",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "おいしい",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 2
} ]
}
The kuromoji_readingform
token filter replaces the token with its reading
form in either katakana or romaji. It accepts the following setting:
use_romaji
-
Whether romaji reading form should be output instead of katakana. Defaults to
false
.
When using the pre-defined kuromoji_readingform
filter, use_romaji
is set
to true
. The default when defining a custom kuromoji_readingform
, however,
is false
. The only reason to use the custom form is if you need the
katakana reading form:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"romaji_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [ "romaji_readingform" ]
},
"katakana_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [ "katakana_readingform" ]
}
},
"filter": {
"romaji_readingform": {
"type": "kuromoji_readingform",
"use_romaji": true
},
"katakana_readingform": {
"type": "kuromoji_readingform",
"use_romaji": false
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "katakana_analyzer",
"text": "寿司" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "romaji_analyzer",
"text": "寿司" (2)
}
-
Returns
スシ
. -
Returns
sushi
.
The kuromoji_stemmer
token filter normalizes common katakana spelling
variations ending in a long sound character by removing this character
(U+30FC). Only full-width katakana characters are supported.
This token filter accepts the following setting:
minimum_length
-
Katakana words shorter than the
minimum length
are not stemmed (default is4
).
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_katakana_stemmer"
]
}
},
"filter": {
"my_katakana_stemmer": {
"type": "kuromoji_stemmer",
"minimum_length": 4
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "コピー" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "サーバー" (2)
}
-
Returns
コピー
. -
Return
サーバ
.
The ja_stop
token filter filters out Japanese stopwords (japanese
), and
any other custom stopwords specified by the user. This filter only supports
the predefined japanese
stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[stop
token filter] instead.
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_with_ja_stop": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"ja_stop"
]
}
},
"filter": {
"ja_stop": {
"type": "ja_stop",
"stopwords": [
"_japanese_",
"ストップ"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "analyzer_with_ja_stop",
"text": "ストップは消える"
}
The above request returns:
{
"tokens" : [ {
"token" : "消える",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
The kuromoji_number
token filter normalizes Japanese numbers (kansūji)
to regular Arabic decimal numbers in half-width characters. For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_number"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "一〇〇〇"
}
Which results in:
{
"tokens" : [ {
"token" : "1000",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
} ]
}
The hiragana_uppercase
token filter normalizes small letters (捨て仮名) in hiragana into standard letters.
This filter is useful if you want to search against old style Japanese text such as
patents, legal documents, contract policies, etc.
For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"hiragana_uppercase"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "ちょっとまって"
}
Which results in:
{
"tokens": [
{
"token": "ちよつと",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "まつ",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "て",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 2
}
]
}
The katakana_uppercase
token filter normalizes small letters (捨て仮名) in katakana into standard letters.
This filter is useful if you want to search against old style Japanese text such as
patents, legal documents, contract policies, etc.
For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"katakana_uppercase"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "ストップウォッチ"
}
Which results in:
{
"tokens": [
{
"token": "ストツプウオツチ",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}
The kuromoji_completion
token filter adds Japanese romanized tokens to the term attributes along with the original tokens (surface forms).
GET _analyze
{
"analyzer": "kuromoji_completion",
"text": "寿司" (1)
}
-
Returns
寿司
,susi
(Kunrei-shiki) andsushi
(Hepburn-shiki).
The kuromoji_completion
token filter accepts the following settings:
mode
-
The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:
index
-
Simple romanization. Expected to be used when indexing.
query
-
Input Method aware romanization. Expected to be used when querying.
Defaults to
index
.