ES集成IK分词器

# 40.ES集成IK分词器

由于ES使用的也是默认分析器，对中文的分析结果不好，本文就来解决这个问题

‍

# 查看分析器的分词效果

在集成中文分析器之前，我们先看看分词效果，不然集成后也看不出效果

我们可以使用postman，发送这样的请求：

http://127.0.0.1:9200/_analyze?analyzer=standard&pretty=true&text=Crazy Thursday V me fifty

参数说明：

_analyze：说明要分析
analyzer：要使用的分析器
pretty：美化下返回的结果
text：要分析的文本

‍

分析结果：可以看到每个词是一个关键字。

{
    "tokens": [
        {
            "token": "crazy",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "thursday",
            "start_offset": 6,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "v",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "me",
            "start_offset": 17,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "fifty",
            "start_offset": 20,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

‍

而如果我们要分析的内容是中文：

get http://127.0.0.1:9200/_analyze?analyzer=standard&pretty=true&text=我是程序员

‍

返回结果：可以看到每个字一个关键词

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "程",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "序",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

‍

# 集成IK分析器

支持中文分词的分词器有很多，word分词器、庖丁解牛、盘古分词、Ansj分词等，但我们常用的还是下面要介绍的IK分词器。

在我的百度云网盘 (opens new window)中，分享了IK的插件，路径为编程资料/Java相关/06.主流框架/20.Elasticsearch/ik-analyzer.zip，读者可以下载；

下载后，我们将这个压缩包放到ES的plugins目录下，然后解压缩，可以看到这样的内容：

‍

主要是该插件依赖的一些jar包，以及一些配置文件。plugin-descriptor.properties是插件的配置，这里我们不用动。

‍

然后我们重启ES，启动日志中能看到加载了ik插件：

‍

# 测试

IK提供了两个分词算法ik_smart（最少切分）和 ik_max_word（最细粒度划分），我们分别来试一下

‍

我们在postman中发送请求：

http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=我是程序员

‍

返回结果：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "程序员",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

‍

如果是ik_max_word：

http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=我是程序员

1
2

‍

结果：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "程序员",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "程序",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 4
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

‍

# 在索引库中使用分析器

‍

# 初始化索引

之前我们创建索引的时候，使用的是标准分析器，由于不能修改type的mappings（虽然可以新增type并指定分析器），我们删除所有索引后新建。

‍

发送PUT请求：

{
    "mappings": {
        "hello": {
            "properties": {
                "id": {
                    "type":"long",
                    "store": true
                },
                "title": {
                    "type":"text",
                    "store": true,
                    "analyzer": "ik_smart"
                },
                "content": {
                    "type":"text",
                    "store": true,
                    "analyzer": "ik_smart"
                }
            }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

‍

返回结果：创建成功

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "blog"
}

1
2
3
4
5

‍

# 添加数据

我们新增下数据（和之前数据一样）

请求URL：POST 127.0.0.1:9200/blog/hello/2

{
    "id": 2,
    "title": "关于买房",
    "content": "虚假的挥霍：花几千块钱去看看海。真正的挥霍：花100万交个首付买一个名义面积70平，实际面积55平，再还30年贷款的小户型住房"
}

1
2
3
4
5

‍

请求URL：POST 127.0.0.1:9200/blog/hello/3

{
    "id": 3,
    "title": "关于梦想",
    "content": "编程对我而言，就像是一颗小小的，微弱的希望的种子，我甚至都不愿意让人看见它。生怕有人看见了便要嘲讽它，它太脆弱了，经不起别人的质疑"
}

1
2
3
4
5

‍

请求URL：POST 127.0.0.1:9200/blog/hello/4

{
    "id": 4,
    "title": "关于打工",
    "content": "其实，我对公司是有点失望的。当初给我司定位为大厂，是高于公司简介的水平的。我是希望进来后，公司能够拼一把，快速成长起来的。大厂这个层级不是只发工资就可以的，公司需要有体系化的待遇。你们给的待遇，它的价值在哪里? 公司是否作出了壁垒形成了核心竞争力? 公司的待遇，和其他公司的差异化在哪里？为什么我来你这上班，我不能去别人那上班吗? 公司需要听从员工的意见，而不是想做什么就做什么，我需要符合劳动法的公司。"
}

1
2
3
4
5

‍

# 测试

‍

我们请求URL： POST http://127.0.0.1:9200/blog/hello/_search，发送的请求体：

{
    "query": {
        "term": {
            "title": "梦想"
        }
    }
}

1
2
3
4
5
6
7

‍

响应结果：

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.25811607,
        "hits": [
            {
                "_index": "blog",
                "_type": "hello",
                "_id": "3",
                "_score": 0.25811607,
                "_source": {
                    "id": 3,
                    "title": "关于梦想",
                    "content": "编程对我而言，就像是一颗小小的，微弱的希望的种子，我甚至都不愿意让人看见它。生怕有人看见了便要嘲讽它，它太脆弱了，经不起别人的质疑"
                }
            }
        ]
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

‍

也可以用query_string，还是请求POST http://127.0.0.1:9200/blog/hello/_search。请求体：

{
    "query": {
        "query_string": {
            "default_field": "content",
            "query": "当初给我司定位为大厂"
        }
    }
}

1
2
3
4
5
6
7
8

‍

响应结果：

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 3.5317287,
        "hits": [
            {
                "_index": "blog",
                "_type": "hello",
                "_id": "4",
                "_score": 3.5317287,
                "_source": {
                    "id": 4,
                    "title": "关于打工",
                    "content": "其实，我对公司是有点失望的。当初给我司定位为大厂，是高于公司简介的水平的。我是希望进来后，公司能够拼一把，快速成长起来的。大厂这个层级不是只发工资就可以的，公司需要有体系化的待遇。你们给的待遇，它的价值在哪里? 公司是否作出了壁垒形成了核心竞争力? 公司的待遇，和其他公司的差异化在哪里？为什么我来你这上班，我不能去别人那上班吗? 公司需要听从员工的意见，而不是想做什么就做什么，我需要符合劳动法的公司。"
                }
            }
        ]
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

‍

在GitHub上编辑此页

上次更新: 2024/1/23 08:20:41

← ES索引库的查询 ES集群→