Japanese (kuromoji) analysis plugin

The Japanese (kuromoji) analysis plugin integrates Lucene kuromoji analysis module into {es}.

`kuromoji` analyzer

The kuromoji analyzer uses the following analysis chain:

CJKWidthCharFilter from Lucene
kuromoji_tokenizer
kuromoji_baseform token filter
kuromoji_part_of_speech token filter
ja_stop token filter
kuromoji_stemmer token filter
{ref}/analysis-lowercase-tokenfilter.html[lowercase] token filter

It supports the mode and user_dictionary settings from kuromoji_tokenizer.

Normalize full-width characters

The kuromoji_tokenizer tokenizer uses characters from the MeCab-IPADIC dictionary to split text into tokens. The dictionary includes some full-width characters, such as ｏ and ｆ. If a text contains full-width characters, the tokenizer can produce unexpected tokens.

For example, the kuromoji_tokenizer tokenizer converts the text Ｃｕｌｔｕｒｅ　ｏｆ　Ｊａｐａｎ to the tokens [ culture, o, f, japan ] instead of [ culture, of, japan ].

To avoid this, add the icu_normalizer character filter to a custom analyzer based on the kuromoji analyzer. The icu_normalizer character filter converts full-width characters to their normal equivalents.

First, duplicate the kuromoji analyzer to create the basis for a custom analyzer. Then add the icu_normalizer character filter to the custom analyzer. For example:

PUT index-00001
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "kuromoji_normalize": {                 (1)
            "char_filter": [
              "icu_normalizer"                    (2)
            ],
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform",
              "kuromoji_part_of_speech",
              "cjk_width",
              "ja_stop",
              "kuromoji_stemmer",
              "lowercase"
            ]
          }
        }
      }
    }
  }
}

Creates a new custom analyzer, kuromoji_normalize, based on the kuromoji analyzer.
Adds the icu_normalizer character filter to the analyzer.

`kuromoji_iteration_mark` character filter

The kuromoji_iteration_mark normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. It accepts the following settings:

normalize_kanji: Indicates whether kanji iteration marks should be normalized. Defaults to true.
normalize_kana: Indicates whether kana iteration marks should be normalized. Defaults to true

`kuromoji_tokenizer`

The kuromoji_tokenizer accepts the following settings:

mode

The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:

normal

Normal segmentation, no decomposition for compounds. Example output:

関西国際空港
アブラカダブラ

search

Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:

関西, 関西国際空港, 国際, 空港
アブラカダブラ

extended

Extended mode outputs unigrams for unknown words. Example output:

関西, 関西国際空港, 国際, 空港
ア, ブ, ラ, カ, ダ, ブ, ラ

discard_punctuation

Whether punctuation should be discarded from the output. Defaults to true.

lenient

Whether the user_dictionary should be deduplicated on the provided text. False by default causing duplicates to generate an error.

user_dictionary

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ja.txt:

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

You can also inline the rules directly in the tokenizer definition using the user_dictionary_rules option:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

nbest_cost/nbest_examples

Additional expert user parameters nbest_cost and nbest_examples can be used to include additional tokens that are most likely according to the statistical model. If both parameters are used, the largest number of both is applied.

nbest_cost: The nbest_cost parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path.
nbest_examples: The nbest_examples can be used to find a nbest_cost value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).

Then create an analyzer as follows:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt",
            "lenient": "true"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "東京スカイツリー"
}

The above analyze request returns the following:

{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 1
  } ]
}

discard_compound_token

Whether original compound tokens should be discarded from the output with search mode. Defaults to false. Example output with search or extended mode and this option true:

関西, 国際, 空港

Note	If a text contains full-width characters, the `kuromoji_tokenizer` tokenizer can produce unexpected tokens. To avoid this, add the `icu_normalizer` character filter to your analyzer. See Normalize full-width characters.

`kuromoji_baseform` token filter

The kuromoji_baseform token filter replaces terms with their BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "飲み"
}

which responds with:

{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  } ]
}

`kuromoji_part_of_speech` token filter

The kuromoji_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

stoptags: An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar.

For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "寿司がおいしいね"
}

Which responds with:

{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  } ]
}

`kuromoji_readingform` token filter

The kuromoji_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:

use_romaji: Whether romaji reading form should be output instead of katakana. Defaults to false.

When using the pre-defined kuromoji_readingform filter, use_romaji is set to true. The default when defining a custom kuromoji_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "romaji_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [ "romaji_readingform" ]
          },
          "katakana_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [ "katakana_readingform" ]
          }
        },
        "filter": {
          "romaji_readingform": {
            "type": "kuromoji_readingform",
            "use_romaji": true
          },
          "katakana_readingform": {
            "type": "kuromoji_readingform",
            "use_romaji": false
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "katakana_analyzer",
  "text": "寿司" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "romaji_analyzer",
  "text": "寿司" (2)
}

Returns スシ.
Returns sushi.

`kuromoji_stemmer` token filter

The kuromoji_stemmer token filter normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only full-width katakana characters are supported.

This token filter accepts the following setting:

minimum_length: Katakana words shorter than the minimum length are not stemmed (default is 4).

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_stemmer",
            "minimum_length": 4
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "コピー" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "サーバー" (2)
}

Returns コピー.
Return サーバ.

`ja_stop` token filter

The ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the {ref}/analysis-stop-tokenfilter.html[stop token filter] instead.

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_ja_stop": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "ja_stop"
            ]
          }
        },
        "filter": {
          "ja_stop": {
            "type": "ja_stop",
            "stopwords": [
              "_japanese_",
              "ストップ"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "analyzer_with_ja_stop",
  "text": "ストップは消える"
}

The above request returns:

{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}

`kuromoji_number` token filter

The kuromoji_number token filter normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters. For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_number"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "一〇〇〇"
}

Which results in:

{
  "tokens" : [ {
    "token" : "1000",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 0
  } ]
}

`hiragana_uppercase` token filter

The hiragana_uppercase token filter normalizes small letters (捨て仮名) in hiragana into standard letters. This filter is useful if you want to search against old style Japanese text such as patents, legal documents, contract policies, etc.

For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "hiragana_uppercase"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "ちょっとまって"
}

Which results in:

{
  "tokens": [
    {
      "token": "ちよつと",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "まつ",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "て",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

`katakana_uppercase` token filter

The katakana_uppercase token filter normalizes small letters (捨て仮名) in katakana into standard letters. This filter is useful if you want to search against old style Japanese text such as patents, legal documents, contract policies, etc.

For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "katakana_uppercase"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "ストップウォッチ"
}

Which results in:

{
  "tokens": [
    {
      "token": "ストツプウオツチ",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    }
  ]
}

`kuromoji_completion` token filter

The kuromoji_completion token filter adds Japanese romanized tokens to the term attributes along with the original tokens (surface forms).

GET _analyze
{
  "analyzer": "kuromoji_completion",
  "text": "寿司" (1)
}

Returns 寿司, susi (Kunrei-shiki) and sushi (Hepburn-shiki).

The kuromoji_completion token filter accepts the following settings:

mode

The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:

index: Simple romanization. Expected to be used when indexing.
query: Input Method aware romanization. Expected to be used when querying.

Defaults to index.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis-kuromoji.asciidoc

analysis-kuromoji.asciidoc

Japanese (kuromoji) analysis plugin

`kuromoji` analyzer

Normalize full-width characters

`kuromoji_iteration_mark` character filter

`kuromoji_tokenizer`

`kuromoji_baseform` token filter

`kuromoji_part_of_speech` token filter

`kuromoji_readingform` token filter

`kuromoji_stemmer` token filter

`ja_stop` token filter

`kuromoji_number` token filter

`hiragana_uppercase` token filter

`katakana_uppercase` token filter

`kuromoji_completion` token filter

Files

analysis-kuromoji.asciidoc

Latest commit

History

analysis-kuromoji.asciidoc

File metadata and controls

Japanese (kuromoji) analysis plugin

kuromoji analyzer

Normalize full-width characters

kuromoji_iteration_mark character filter

kuromoji_tokenizer

kuromoji_baseform token filter

kuromoji_part_of_speech token filter

kuromoji_readingform token filter

kuromoji_stemmer token filter

ja_stop token filter

kuromoji_number token filter

hiragana_uppercase token filter

katakana_uppercase token filter

kuromoji_completion token filter

`kuromoji` analyzer

`kuromoji_iteration_mark` character filter

`kuromoji_tokenizer`

`kuromoji_baseform` token filter

`kuromoji_part_of_speech` token filter

`kuromoji_readingform` token filter

`kuromoji_stemmer` token filter

`ja_stop` token filter

`kuromoji_number` token filter

`hiragana_uppercase` token filter

`katakana_uppercase` token filter

`kuromoji_completion` token filter