ElasticSearch ik中文分词器配置
1、环境
windows10、JDK1.8、ElasticSearch 6.0.0
2、插件安装
在bin目录下执行命令 elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip ,安装成功后在plugins目录下会出现analysis-ik文件夹。也可以选择手动安装,详情可以参照 https://github.com/medcl/elasticsearch-analysis-ik。
然后重启elastic search,出现如下字段代表插件安装成功。
注意:ik分词器版本要和自己的es版本一致!
3、测试
重启elasticSearch之后,进行分词测试:
GET _analyze { "analyzer": "ik_smart", "text": "中华人民共和国国歌" }
{ "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ] }
4、自定义词典
GET _analyze { "analyzer": "ik_smart", "text": "王者荣耀是最好玩的游戏" }
{ "tokens": [ { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "荣耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "是", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 2 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 3 }, { "token": "好玩", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 4 }, { "token": "的", "start_offset": 8, "end_offset": 9, "type": "CN_CHAR", "position": 5 }, { "token": "游戏", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 6 } ] }
可以发现“王者荣耀”是被分开的,原因是因为词典里没有“王者荣耀”这个词,我们可以创建自己的词典,添加进“王者荣耀”,这样就可以得到我们想要的结果。
...\elasticsearch-6.0.0\config\analysis-ik\custom\mydict.dic
王者荣耀
添加到配置文件...\elasticsearch-6.0.0\config\analysis-ik\IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">custom/mydict.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords"></entry> <!--用户可以在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
再次进行解析得到如下的结果:
{ "tokens": [ { "token": "王者荣耀", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "是", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 1 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 2 }, { "token": "好玩", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 3 }, { "token": "的", "start_offset": 8, "end_offset": 9, "type": "CN_CHAR", "position": 4 }, { "token": "游戏", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 5 } ] }
根据配置文件还可以自定义配置停止词词典,远程扩展词典等等