elasticsearch ik分词安装,elasticsearch ik分词库热更新

前置条件 如果发现问题请留言
如果有发现不正确的地方,欢迎留言指正,感激不尽! 已安装好Elasticsearch
本次安装插件版本为7.3.1,需与elasticsearch版本一致, elasticsearch安装在/home/elk/elasticsearch-7.3.1下.如果你想安装一个7.3.1版本的elasticsearch,可以参照Centos7安装Elasticsearch&Kibana进行安装 已安装好kibana
本教程中的kibana安装在home/elk/kibana-7.3.1-linux-x86_64下,使用kibana的dev tool执行相关分词验证操作 已下载好ik分词插件
本次安装插件包为elasticsearch-analysis-ik-7.3.1.zip
官方下载地址找到相应版本进行下载 系统及操作用户等
本次安装操作系统为Centos7,用户为elk,已经有专门存放软件包的目录/home/elk/soft ,本教程中kibana访问地址为192.168.1.14:5601 插件安装插件上传

通过xftp工具将分词插件包elasticsearch-analysis-ik-7.3.1.zip上传到虚拟机的/home/elk/soft目录下

插件安装

以下命令全部使用elk用户操作

# 停止es服务jps |grep Elasticsearch|awk {'print $1'}|xargs kill# 确认es服务已停止,使用jps找不到Elsaticsearch进程则说明正确停止jps |grep Elasticsearch# 插件安装,执行以下命令进行安装,在提示时输入y,然后回车 ~/elasticsearch-7.3.1/典雅的黑夜/elasticsearch-plugin install file:///home/elk/soft/elasticsearch-analysis-ik-7.3.1.zip # 查看已安装插件 ~/elasticsearch-7.3.1/典雅的黑夜/elasticsearch-plugin list

插件配置

ik分词插件自定义词典支持本地词库和远程词库
本地词库:使用~/elasticsearch-7.3.1/config/analysis-ik/目录下的文件作为词典
远程词库:使用url指定扩展词典

本地词库配置 # 创建一个自定义扩展词文件touch ~/elasticsearch-7.3.1/config/analysis-ik/my_extra.dic# 创建一个停用词文件touch ~/elasticsearch-7.3.1/config/analysis-ik/my_stopword.dic# 编辑ik分词器配置文件vim ~/elasticsearch-7.3.1/config/analysis-ik/IKAnalyzer.cfg.xml

按照以下内容进行配置

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"><properties> <comment>IK Analyzer 扩展配置</comment>  <entry key="ext_dict">my_extra.dic</entry>  <entry key="ext_stopwords">my_stopword.dic</entry></properties>

上面的配置指定了一个自定义词典文件my_extra.dic,一个自定义停用词文件my_stopword.dic

远程词库配置

官方配置连接

远程词库支持热词动态刷新

更新条件
http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。返回内容要求
http 请求返回的内容格式是一行一个分词，换行符用 n 即可。 # 安装nginx,你也可以安装apache,或者使用其他机器上安装的nginx或apacheyum install -y epel-releaseyum install -y nginx# 默认情况下nginx网页根目录为/usr/share/nginx/html# 创建目录/usr/share/nginx/html/ikmkdir -p /usr/share/nginx/html/ik# 创建自定义词典文件和自定义停用词文件touch /usr/share/nginx/html/ik/my_extra.dictouch /usr/share/nginx/html/ik/my_stopword.dic# 设置开机启动(可以不设置开机启动) systemctl enable nginx# 启动nginxsystemctl start nginx# 验证词典文件可以访问,如果下面两条命令没有任何返回表明nginx启动成功且分词文件(目前无内容)可以访问curl http://192.168.1.14/ik/my_extra.diccurl http://192.168.1.14/ik/my_stopword.dic# 编辑ik分词器配置文件vim ~/elasticsearch-7.3.1/config/analysis-ik/IKAnalyzer.cfg.xml

按照以下内容进行配置

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"><properties> <comment>IK Analyzer 扩展配置</comment>  <entry key="remote_ext_dict">http://192.168.1.14/ik/my_extra.dic</entry>  <entry key="remote_ext_stopwords">http://192.168.1.14/ik/my_stopword.dic</entry></properties>

上面的配置指定了一个自定义词典文件http://192.168.1.14/ik/my_extra.dic
一个自定义停用词文件http://192.168.1.14/ik/my_extra.dic

插件测试 # 启动es~/elasticsearch-7.3.1/典雅的黑夜/elasticsearch -d# 启动kibana 此种方式启动会有相关日志输出~/kibana-7.3.1-linux-x86_64/典雅的黑夜/kibana &

访问kibana地址192.168.1.14:5601并点击主页面中的开发工具图标

默认分词测试

测试仅针对本地词库进行了测试,远程词库测试与本地词库测试基本一致,只是不需要重启es.

在开发工具左边执行以下命令

GET _analyze{ "analyzer": "ik_smart", "text": ["ElasticSearch是一个基于Lucene的搜索服务器"]}

返回内容如下

{ "tokens" : [ { "token" : "elasticsearch", "start_offset" : 0, "end_offset" : 13, "type" : "ENGLISH", "position" : 0 }, { "token" : "是", "start_offset" : 13, "end_offset" : 14, "type" : "CN_CHAR", "position" : 1 }, { "token" : "一个", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 2 }, { "token" : "基于", "start_offset" : 16, "end_offset" : 18, "type" : "CN_WORD", "position" : 3 }, { "token" : "lucene", "start_offset" : 18, "end_offset" : 24, "type" : "ENGLISH", "position" : 4 }, { "token" : "的", "start_offset" : 24, "end_offset" : 25, "type" : "CN_CHAR", "position" : 5 }, { "token" : "搜索", "start_offset" : 25, "end_offset" : 27, "type" : "CN_WORD", "position" : 6 }, { "token" : "服务器", "start_offset" : 27, "end_offset" : 30, "type" : "CN_WORD", "position" : 7 } ]}

可见初始分词效果还是不错的.

自定义扩展词测试

比如我们希望将搜索服务器作为一个词处理,还希望过滤掉的这个词

# 添加一个自定义词汇,多个词汇用换行分隔echo 搜索服务器 > ~/elasticsearch-7.3.1/config/analysis-ik/my_extra.dic# 添加一个自定义停用词,多个用换行分隔echo 的 > ~/elasticsearch-7.3.1/config/analysis-ik/my_stopword.dic

目前本地词库(本地文件)方式不支持词库热更新需要重启es,才会生效

# 停止es服务jps |grep Elasticsearch|awk {'print $1'}|xargs kill# 确认es服务已停止,使用jps找不到Elsaticsearch进程则说明正确停止jps |grep Elasticsearch# 启动es~/elasticsearch-7.3.1/典雅的黑夜/elasticsearch -d

在开发工具中再次执行以下命令

GET _analyze{ "analyzer": "ik_smart", "text": ["ElasticSearch是一个基于Lucene的搜索服务器"]}

返回内容如下

{ "tokens" : [ { "token" : "elasticsearch", "start_offset" : 0, "end_offset" : 13, "type" : "ENGLISH", "position" : 0 }, { "token" : "是", "start_offset" : 13, "end_offset" : 14, "type" : "CN_CHAR", "position" : 1 }, { "token" : "一个", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 2 }, { "token" : "基于", "start_offset" : 16, "end_offset" : 18, "type" : "CN_WORD", "position" : 3 }, { "token" : "lucene", "start_offset" : 18, "end_offset" : 24, "type" : "ENGLISH", "position" : 4 }, { "token" : "搜索服务器", "start_offset" : 25, "end_offset" : 30, "type" : "CN_WORD", "position" : 5 } ]}

发现返回结果中已经将搜索服务器作为一个词处理,并且去除了的字