Nori

Elastic Search 6.4 부터는 한글 형태소 분석기 노리(nori) 가 추가되었으므로 기존처럼 다른 형태소 분석기를 설치하지 않아도 됩니다.

ES 는 global 한 제품이라 그런지 한국어 분석기인 nori 는 기본적으로 포함되어 있지 않으므로 다음 명령어로 설치하고 ES 를 재구동해 줘야 합니다.

./elasticsearch-plugin install analysis-nori

예제는 curl 대신 사용이 쉬운 httpie 를 사용합니다.

Index 생성

nori analyzer 를 사용하는 nori_test 라는 이름의 index 를 생성합니다.

index 생성

echo '{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "korean": {
                        "type": "nori",
                        "stopwords": "_korean_"
                    }
                }
            }
        }
    }
}' |  \
  http PUT http://localhost:9200/nori_test \
  Content-Type:application/json

이제 "지리산남악제 및 군민의날" 문장에 대해 형태소 분석을 해봅니다.

형태소 분석

echo '{
  "analyzer": "korean",
  "text" : "지리산남악제 및 군민의날."
}
' |  \
  http GET http://localhost:9200/nori_test/_analyze \
  Content-Type:application/json

지리산은 대명사인데 "지리"와 "산" 으로 나뉘고 "남악제"도 특정 지역의 축제이름인데 "남악"만 남고 "제" 는 날려 버리는 것을 볼수 있습니다.

Click here to expand...

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "지리",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "산",
            "type": "word"
        },
        {
            "end_offset": 5,
            "position": 2,
            "start_offset": 3,
            "token": "남악",
            "type": "word"
        },
        {
            "end_offset": 11,
            "position": 5,
            "start_offset": 9,
            "token": "군민",
            "type": "word"
        },
        {
            "end_offset": 13,
            "position": 7,
            "start_offset": 12,
            "token": "날",
            "type": "word"
        }
    ]
}

custom 분석기 사용

위와 같은 문제를 해결하려면 index 내에 custom analyzer 을 만들어 줘야 합니다.

Index 내 analyzer 를 변경하려면 인덱스를 닫고 수정후에 다시 열어 줘야 하므로 다음 명령어로 index 를 닫아줍니다.

close index

http POST http://localhost:9200/nori_test/_close \
  Content-Type:application/json

이제 다음과 같이 index에 tokenizer 를 추가해 줍니다.

index에 analyzer 추가

echo '{
    "analysis": {
        "tokenizer": {
            "nori_user_dict": {
                "type": "nori_tokenizer",
                "decompound_mode": "mixed",
                "user_dictionary": "userdict_ko.txt"
            }
        },
        "analyzer": {
            "my_analyzer": {
                "type": "custom",
                "tokenizer": "nori_user_dict"
            }
        }
    }
}' |  \
  http PUT http://localhost:9200/nori_test/_settings \
  Content-Type:application/json

nori tokenizer 문서에 보면 tokenizer setting 별 의미가 상세히 기술되어 있으며 대략 정리해 보면 다음과 같습니다.

nori_user_dict

새로 추가할 tokenizer 의 이름이 nori_user_dict 임을 알려줍니다. analyzer 설정시 이 이름을 주면 됩니다.

type

nori_tokenizer 로 설정합니다.

decompound_mode

tokenzier 가 복합 명사 token을 어떻게 처리할지 지정하며 다음과 같이 3가지 설정이 가능합니다.

settings	의미	예제
none	복합 명사로 분리하지 않습니다.	가거도항 => 가거도항 가곡역 => 가곡역
discard	복합 명사로 분리하고 원본 데이타는 삭제합니다.	가거도항 => 가거도, 항 가곡역 => 가곡, 역
mixed	복합 명사로 분리하고 원본 데이타도 남겨둡니다.	가거도항 => 가거도항, 가거도, 항 가곡역 => 가곡역, 가곡, 역

user_dictionary

사용자 정의 사전을 설정하며 사전은 ES 의 config 에 위치해야 합니다.

에디터로 편집하거나 아래 명령어로 사전을 생성해 줍니다.

사전 생성

touch config/userdict_ko.txt
echo "지리산" >> config/userdict_ko.txt
echo "남악제" >> config/userdict_ko.txt

이제 index 를 사용할 수 있도록 열어 줍니다.

open index

http POST http://localhost:9200/nori_test/_open \
  Content-Type:application/json

새로운 analyzer로 형태소 분석을 해보면 지리산과 남악제를 잘 처리해 주는 걸 볼 수 있습니다.

형태소 분석

echo '{
  "analyzer": "my_analyzer",
  "text" : "지리산남악제 및 군민의날."
}
' |  \
  http GET http://localhost:9200/nori_test/_analyze \
  Content-Type:application/json

Click here to expand...

형태소 분석 결과

{
    "tokens": [
        {
            "token": "지리산",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "남악제",
            "start_offset": 3,
            "end_offset": 6,
            "type": "word",
            "position": 1
        },
        {
            "token": "및",
            "start_offset": 7,
            "end_offset": 8,
            "type": "word",
            "position": 2
        },
        {
            "token": "군민",
            "start_offset": 9,
            "end_offset": 11,
            "type": "word",
            "position": 3
        },
        {
            "token": "의",
            "start_offset": 11,
            "end_offset": 12,
            "type": "word",
            "position": 4
        },
        {
            "token": "날",
            "start_offset": 12,
            "end_offset": 13,
            "type": "word",
            "position": 5
        }
    ]
}

Ref

Browser not supported