Lucene 入门案例

# 20.Lucene 入门案例

接下来我们用 Lucene 做一个小案例 ‍

# 需求

实现一个文件的搜索功能，通过关键字搜索文件，凡是文件名或文件内容包括关键字的文件都需要找出来。还可以根据中文词语进行查询，并且需要支持多个条件查询。本案例中的原始内容就是磁盘上的文件，如下图：

‍

# 环境准备

新建一个 Maven 项目，导入依赖：

<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-core</artifactId>
  <version>7.4.0</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>

<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analyzers-common</artifactId>
  <version>7.4.0</version>
</dependency>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

说明：

lucene-core：核心
lucene-analyzers-common：分析器
lucene-queryparser：查询分析器

‍

注意 7.4.0 要求 Java8 及以上。

或者从官网下载最新版：Apache Lucene (opens new window)，历史版本可以在这里 (opens new window) 下载。注意有 Lucene 有很多 jar 包，按需引入

为了方便，我们还引入 commons-io 和 Junit

‍

# 创建索引

步骤如下：

创建一个 Director 对象，指定索引库保存的位置。
基于 Directory 对象创建一个 IndexWriter 对象（用来写索引）
读取磁盘上的文件，对应每个文件创建一个文档对象。
创建 field 对象，将 field 添加到 document 对象中。
使用 IndexWriter 对象将 document 对象写入索引库，此过程进行索引创建，并将索引和 document 对象写入索引库。
关闭 IndexWriter 对象。

‍

我们新建一个临时目录用来存放索引，例如我用的是 D:\temp\index。虽然也可以将索引保存在内存中，到那时这样容易丢失索引，且索引本身创建就耗时，这里就先不演示了。

然后我们创建 D:\temp\searchsource 目录，里面存放我们的原始文档（可以在我的 Gitee (opens new window) 或 GitHub (opens new window) 上下载）

‍

第一步：创建一个 Director 对象，指定索引库保存的位置。

Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());

‍

第二步：基于 Directory 对象创建一个 IndexWriter 对象（用来写索引），需要传入一个配置对象 IndexWriterConfig，这里我们用默认的。

IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());

‍

第三、四步：读取磁盘上的文件，对应每个文件创建一个文档对象：

首先得到一个 FileList
然后读取每个文件的名字，路径，内容，大小
创建对应的 Filed 字段
创建一个新的 Document 对象，然后填充 Field 字段
写入索引库

File dir = new File("D:\\temp\\searchsource");
File[] files = dir.listFiles();
for(File file : files) {
    String fileName = file.getName();
    String filePath = file.getPath();
    String fileContent = FileUtils.readFileToString(file, "utf-8");
    long fileSize = FileUtils.sizeOf(file);

    // 创建Field
    // 参数1：域的名称, 参数2：域的内容, 参数3：是否存储
    Field fieldName = new TextField("name", fileName, Field.Store.YES);
    Field fieldPath = new TextField("path", filePath, Field.Store.YES);
    Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
    Field fieldSize = new TextField("size", String.valueOf(fileSize), Field.Store.YES);

    // 创建文档对象
    Document document = new Document();
    document.add(fieldName);
    document.add(fieldPath);
    document.add(fieldContent);
    document.add(fieldSize);

    //  4. 创建field对象，将field添加到document对象中。

    //  5. 使用IndexWriter对象将document对象写入索引库，此过程进行索引创建，并将索引和document对象写入索引库。
    indexWriter.addDocument(document);

}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

‍

最后关闭即可：

indexWriter.close();

‍

完整代码：

package com.peterjxl.lucene;
import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;
import java.io.File;

public class LuceneFirst {
 
    @Test
    public void createIndex() throws Exception {
        //  1. 创建一个Director对象，指定索引库保存的位置。
        Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());

        //  2. 基于Directory对象创建一个IndexWriter对象（用来写索引）
        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());

        //  3. 读取磁盘上的文件，对应每个文件创建一个文档对象。
        File dir = new File("D:\\temp\\searchsource");
        File[] files = dir.listFiles();
        for(File file : files) {
            String fileName = file.getName();
            String filePath = file.getPath();
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            long fileSize = FileUtils.sizeOf(file);

            // 4. 创建Field
            // 参数1：域的名称, 参数2：域的内容, 参数3：是否存储
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            Field fieldSize = new TextField("size", String.valueOf(fileSize), Field.Store.YES);

            // 4. 将field添加到document对象中。
            Document document = new Document();
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);
        
            //  5. 使用IndexWriter对象将document对象写入索引库，此过程进行索引创建，并将索引和document对象写入索引库。
            indexWriter.addDocument(document);
        }

        //  6. 关闭IndexWriter对象。
        indexWriter.close();
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

‍

我们运行，然后打开 D:\temp\index，可以看到有一些索引文件：

‍

# 使用 Luke

索引文件是二进制文件，如果我们想直接看是不行的；我们可以借住一些工具来查看，例如 Luke。Luke 是一个用于 Lucene/Solr/Elasticsearch 搜索引擎的，方便开发和诊断的 GUI（可视化）工具。

Luke 的 GitHub 地址：DmitryKey/luke (opens new window)，这里我们使用 7.4.0 的，可以去 GitHub (opens new window) 上下载或去我的百度云 (opens new window) 上下载，相关路径为 Java相关/06.主流框架/10.Lucene

注意：

由于 Luke 的兼容性不太好，不同版本的 Lucene 生成的索引要使用对应版本的 Luke 进行分析，如果版本过低会导致无法正确解析索引。
下载后，直接启动 luke.bat 即可，该项目是使用 JavaFX 开发的。需要注意的是
此版本的 Luke 是 jdk9 编译的，按理来说运行此工具还需要 JDK9，但实测 Java8 也可以正常使用

‍

运行后，我们选择索引所在的位置，然后确定：

‍

然后我们就可以看到索引库的概况：

例如索引库的基本信息：索引位置，Field 的数量等
下边还可以看到关键字的内容，例如关键字是什么，频率是多少（Frequency）等

‍

在 Documents 选修课，还能看到每个文档的内容。例如目前是 0 号文档，下方则是该文档的各个字段（之前我们设置了 Field.Store 为 Yes，所以存储了）

‍

还有 search 标签，可以试试搜索功能：

‍

之后的 Luke 的其他选项卡就用的较少，这里就不一一介绍了。

‍

# 查询索引

接下来我们继续写代码，实现查询索引的功能。步骤：

第一步：创建一个 Directory 对象，也就是索引库存放的位置。
第二步：创建一个 indexReader 对象，需要指定 Directory 对象。
第三步：创建一个 indexsearcher 对象，需要指定 IndexReader 对象
第四步：创建一个 TermQuery 对象，指定查询的域和查询的关键词。
第五步：执行查询。得到一个 TopDocs 对象，包含查询结果的总记录数，和文档列表
第六步：返回查询结果。遍历查询结果并输出。
第七步：关闭 IndexReader 对象

‍

代码如下：

@Test
public void searchIndex() throws Exception {
    // 1. 第一步：创建一个Directory对象，也就是索引库存放的位置。
    Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());

    // 2. 第二步：创建一个indexReader对象，需要指定Directory对象。
    IndexReader indexReader = DirectoryReader.open(directory);

    // 3. 第三步：创建一个indexsearcher对象，需要指定IndexReader对象
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);

    // 4. 第四步：创建一个TermQuery对象，指定查询的域和查询的关键词。
    Query query = new TermQuery(new Term("content", "spring"));


    // 5. 第五步：执行查询。得到一个TopDocs对象，包含查询结果的总记录数，和文档列表
    // 参数1：查询对象 参数2：查询结果返回的最大记录数
    TopDocs topDocs = indexSearcher.search(query, 10);
    System.out.println("查询出来的总记录数：" + topDocs.totalHits);

    // 6. 第六步：返回查询结果。遍历查询结果并输出。
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        // 取文档id
        int docId = scoreDoc.doc;
        // 根据id取文档对象
        Document document = indexSearcher.doc(docId);
        // 取文档的属性
        System.out.println("name: " + document.get("name"));
        System.out.println("path: " + document.get("path"));
        System.out.println("size: " + document.get("size"));
        //System.out.println("content: " + document.get("content"));
        System.out.println("-------------分割线-----------------");
    }

    // 7. 第七步：关闭IndexReader对象
    indexReader.close();
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

‍

为了方便，我们不输出文档的内容，太多了。运行结果：

查询出来的总记录数：6
name: spring_README.txt
path: D:\temp\searchsource\spring_README.txt
size: 3257
-------------分割线-----------------
name: springmvc.txt
path: D:\temp\searchsource\springmvc.txt
size: 2124
-------------分割线-----------------
name: 2.Serving Web Content.txt
path: D:\temp\searchsource\2.Serving Web Content.txt
size: 35
-------------分割线-----------------
name: 1.create web page.txt
path: D:\temp\searchsource\1.create web page.txt
size: 47
-------------分割线-----------------
name: spring.txt
path: D:\temp\searchsource\spring.txt
size: 82
-------------分割线-----------------
name: cxf_README.txt
path: D:\temp\searchsource\cxf_README.txt
size: 3770
-------------分割线-----------------

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

‍

# 源码

已将源码上传到 Gitee (opens new window) 或 GitHub (opens new window) 上。并且创建了分支 demo1，读者可以通过切换分支来查看本文的示例代码

‍

上次更新: 2024/10/3 10:01:52

← Lucene 概述分析器→