ES近义词匹配

玉兰 • 2023-05-30 20:08 • 杂文

ES近义词匹配

ES近义词匹配搜索需要用户提供一张满足相应格式的近义词表，并在创建索引时设计将该表放入settings中。

近义词表的可以直接以字符串的形式写入settings中也可以放入文本文件中，由es读取。

近义词表格式

近义词表需要满足以下格式要求：

A => B,C格式
- 这种格式在搜索时会将搜索词A替换成B、C，且B，C互不为同义词
A,B,C,D 格式

这种格式得分情况讨论：

当expand == true时，这种格式等价于A,B,C,D => A,B,C,D即ABCD互为同义词
当expand == false时，这种格式等价于A,B,C,D => A，即ABCD四个词在搜索时会被替换成A

如何使用近义词表进行查询

建立索引

PUT /fond_goods
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_whitespace":{
          "tokenizer":"whitespace",
          "filter": ["synonymous_filter"]
        }
      },
      "filter": {
        "synonymous_filter":{
          "type": "synonym",
          "expand": true
          "synonyms": [
            "A, B, C, D"
            ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "code":{
        "type": "keyword"
      },
      "context":{
        "type": "text",
        "analyzer": "my_whitespace"
      },
      "color":{
        "type": "text",
        "analyzer": "my_whitespace"
      }
    }
  }
}

参数解释

expand 默认值为 true 。
lenient 默认值为false 若lenient值为true， es会忽略转换近义词文件时的报错。值得注意的是，只有当遇到近义词无法转换时出现的异常才会被忽略掉，具体例子可以参考官网 [ https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis-synonym-tokenfilter.html ]。
synonyms近义词表，即开始所说要按格式填写的近义词表。
synonyms也可替换成synonyms_path，此时需要填写一个外部文件的路径。该文件可以是某个外部的网页，也可以是存放在本地的文件。
format 当该参数值为wordnet时，可以使用wordnet英文词汇数据库中的近义词。

使用案例

构建索引

PUT /fond_goods
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_whitespace":{................................................................ I
          "tokenizer":"whitespace",
          "filter": ["synonymous_filter"]
        }
      },
      "filter": {
        "synonymous_filter":{
          "type": "synonym",
          "synonyms_path": "synonym.txt"................................................. II
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "code":{
        "type": "keyword"
      },
      "context":{
        "type": "text",
        "analyzer": "my_whitespace"
      },
      "color":{
        "type": "text",
        "analyzer": "my_whitespace"
      }
    }
  }
}

注：

  I：`my_whitespace`为自定义分词器

  II：此处的synonyms_path为es文件夹中以config文件夹为基准的相对路径

在相应路径中存入近义词文件

Women,women,girl,girls
yellow,orange,wheat
blue,skyblue
white,snow,silver
dress,dresses,skirt,skirts
autumn,fall
shirt,shirts
A,B,C

存入测试数据

POST _bulk
{"index" : {"_index" : "fond_goods", "_id":1}}
{"code" : 1,"context" : "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt", "color": "red"}
{"index" : {"_index" : "fond_goods", "_id":2}}
{"code" : 2,"context" : "2021 warmth pullover sweater fall", "color": "blue"}
{"index" : {"_index" : "fond_goods", "_id":3}}
{"code" : 3,"context" : "early autumn elegant dress women dress 2021 autumn new long sleeve", "color": "yellow"}
{"index" : {"_index" : "fond_goods", "_id":4}}
{"code" : 4,"context" : "2021 autumn new  sweater yama autumn and winter female  autumn and winter dot cardigan knitted coat", "color": "snow"}
{"index" : {"_index" : "fond_goods", "_id":5}}
{"code" : 5,"context" : "za satin party dinner skirts suits woemn sexy bandage shirts and high split skirt elegant luxurious female dinner sets", "color": "white"}
{"index" : {"_index" : "fond_goods", "_id":6}}
{"code" : 6,"context" : "big bow tie sweet puff sleeve shirt dress long sleeve shirt skirt solid color shirt dress short skirt ", "color": "moss green"}
{"index" : {"_index" : "fond_goods", "_id":7}}
{"code" : 7,"context" : "casual button plaid short skirts women streetwear a-line summer skirts female high waist yellow autumn short skirts", "color": "skyblue "}
{"index" : {"_index" : "fond_goods", "_id":8}}
{"code" : 8,"context" : "muslim middle east women fashion dress abaya long dress muslim dress arab dress dres", "color": "orange"}
{"index" : {"_index" : "fond_goods", "_id":9}}
{"code" : 9,"context" : "sexy white party dresses autumn winter sexy mini dresses women fashion solid color off shoulder short", "color": "wheat"}
{"index" : {"_index" : "fond_goods", "_id":10}}
{"code" : 10,"context" : "women green patchwork buttons bodycon mini dresses all-match office ladies long shirt dresses autumn party vestidos new", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":11}}
{"code" : 11,"context" : "A", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":12}}
{"code" : 12,"context" : "B", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":13}}
{"code" : 13,"context" : "C", "color": "silver"}

简单应用

简单尝试一下近义词库查询

查询条件

GET fond_goods/_search
{
  "query": {
    "match": {
      "context": "A"
    }
  }
}

查询结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.7354302,
    "hits" : [
      {
        "_index" : "fond_goods",
        "_type" : "_doc",
        "_id" : "11",
        "_score" : 2.7354302,
        "_source" : {
          "code" : 11,
          "context" : "A",
          "color" : "silver"
        }
      },
      {
        "_index" : "fond_goods",
        "_type" : "_doc",
        "_id" : "12",
        "_score" : 2.7354302,
        "_source" : {
          "code" : 12,
          "context" : "B",
          "color" : "silver"
        }
      },
      {
        "_index" : "fond_goods",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 2.7354302,
        "_source" : {
          "code" : 13,
          "context" : "C",
          "color" : "silver"
        }
      }
    ]
  }
}

删除数据

删除语句

POST fond_goods/_delete_by_query
{
  "query": {
    "match": {
      "context": "A"
    }
  }
}

删除结果

{
  "took" : 5,
  "timed_out" : false,
  "total" : 3,
  "deleted" : 3,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

我们一共插入了三条A、B、C这组同义词的数据，一共删除了三条数据；可以看出，在删除时，我们也将A的近义词B、C给删除了

结论

我们使用A为查询条件，但结果中出现了B、C的数据，即近义词查询成功
我们以A为查询条件，而结果的相关性打分中，B、C的得分与A一致，即表明在查询时，A、B、C是完全等价的，es的相关性打分无法做出区分
在根据条件删除数据时，近义词的数据也会一同删除

动态更新近义词文件

es本身提供的近义词功能是在项目启动时读取近义词表文件，并且每一次近义词表文件有更新时都得重启才能再次读取，这就给我们项目使用带来了很大的不便性。

可以使用一款叫做 elasticsearch-analysis-dynamic-synonym的es插件来动态读取近义词文件

插件地址

https://github.com/bells/elasticsearch-analysis-dynamic-synonym

插件使用方法

插件使用方法在项目中有详细介绍，这里简单介绍一下

拷贝项目到本地
将项目打包
在es的 plugins/ 文件夹中新建dynamic-synonym文件夹
将target/releases/elasticsearch-analysis-dynamic-synonym-{version}.zip文件解压到dynamic-synonym中
创建es索引时将同义词配置中的"type": "synonym"

      "filter": {
        "synonymous_filter":{
          "type": "synonym",
          "synonyms_path": "synonym.txt"
        }
      }

修改成"type": "dynamic_synonym"

      "filter": {
        "synonymous_filter":{
          "type": "dynamic_synonym",
          "synonyms_path": "synonym.txt"
        }
      }

注：该插件还提供了一个可选参数interval，即刷新同义词文件时间间隔，默认值为60s

他与原有操作一致，至此，每隔60s，es会自动获取一次同义词文件修改时间，如有变化，es会重新载入同义词文件

同义词查询原理

分词

想了解同义词查询的原理就必须先了解es的分词（Trem）。ES中的分词（Analysis）就是把一段文本拆分成一系列的单词，也叫做文本分析。在es中，分析器（Analyzer）负责处理这一系列操作。

分词演示

ES的分词器主要由字符过滤器（Character Filter）、分词器（Tokenizer）、分词过滤器（Token Filter）组成。

字符过滤器（Character Filter）
1. 以字符流的形式接受文本，并可以通过添加、删除或更改字符来转化文本。
2. 一个Analyzer可以由0个或多个字符过滤器
分词器（Tokenizer）
1. 对经过字符过滤器过滤后的文本按照一定规则分词。一个Analyzer只允许有一个分词器
分词过滤器（Token Filter）
1. 针对分词后的token再次进行过滤，可以增删和修改token，一个分词器中可以有多个token过滤器

同义词过滤器

同义词查询的关键其实就是自定义Token过滤器。该过滤器在收到分词器发过来的数据（我暂时将其称之为分词数据）时，会先读取用户存放的近义词文件，比对分词数据。当出现同义词时，Token过滤器就按照近义词文件配置的规则选定带搜索词组，进行同义词搜索。

我们可以拿之前的索引做个试验：我们的索引使用的是自定义的分析器my_whitespace，其中分词器是whitespace空格分词器，而token Filter 使用的是自定义的近义词过滤器。由上述可知，我们自定义的分析器与官方自带的whitespace分析器唯一的差别就在token Filter上。

我们使用官方的`whitespace`分析器来看一下分词情况：

GET fond_goods/_analyze
{
  "analyzer": "whitespace",
  "field":"context", 
  "text": "A"
}

结果

{
  "tokens" : [
    {
      "token" : "A",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

在经过分析器后，字符A被分成了 "A"这一个分词

再来尝试一个长度更长的字符串

GET fond_goods/_analyze
{
  "analyzer": "whitespace",
  "field":"context", 
  "text": "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt"
}

结果

{
  "tokens" : [
    {
      "token" : "ruffled",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "shirt",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "for",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "women",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "2021",
      "start_offset" : 24,
      "end_offset" : 28,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "fall",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "slim",
      "start_offset" : 34,
      "end_offset" : 38,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "fit",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "pure",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "color",
      "start_offset" : 48,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "all",
      "start_offset" : 54,
      "end_offset" : 57,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "matching",
      "start_offset" : 58,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "off-neck",
      "start_offset" : 67,
      "end_offset" : 75,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "lantern",
      "start_offset" : 76,
      "end_offset" : 83,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "long",
      "start_offset" : 84,
      "end_offset" : 88,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "sleeve",
      "start_offset" : 89,
      "end_offset" : 95,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "slim",
      "start_offset" : 96,
      "end_offset" : 100,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "women",
      "start_offset" : 101,
      "end_offset" : 106,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "short",
      "start_offset" : 107,
      "end_offset" : 112,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "shirt",
      "start_offset" : 113,
      "end_offset" : 118,
      "type" : "word",
      "position" : 19
    }
  ]
}

结果

可以看到，whitespace分析器将输入字符串按照空格拆分成了如上结果

我们再来试试自定义的分析器

GET fond_goods/_analyze
{
  "analyzer": "my_whitespace",
  "field":"context", 
  "text": "A"
}

结果

{
  "tokens" : [
    {
      "token" : "A",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "B",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "C",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

经过分析器后，A这个字符被分成了 A、B、C三个分词，且在type字段上有作区分，A被标记为word，B、C被标记为SYNONYM

我们再尝试一下长字符串（注：在近义词文件中，我们定义了shirt,shirts为一组近义词；Women,women,girl,girls为一组近义词）

GET fond_goods/_analyze
{
  "analyzer": "my_whitespace",
  "field":"context", 
  "text": "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt"
}

结果

{
  "tokens" : [
    {
      "token" : "ruffled",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "shirt",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shirts",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "for",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "women",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "Women",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "girl",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "girls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "2021",
      "start_offset" : 24,
      "end_offset" : 28,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "fall",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "autumn",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "SYNONYM",
      "position" : 5
    },
    {
      "token" : "slim",
      "start_offset" : 34,
      "end_offset" : 38,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "fit",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "pure",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "color",
      "start_offset" : 48,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "all",
      "start_offset" : 54,
      "end_offset" : 57,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "matching",
      "start_offset" : 58,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "off-neck",
      "start_offset" : 67,
      "end_offset" : 75,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "lantern",
      "start_offset" : 76,
      "end_offset" : 83,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "long",
      "start_offset" : 84,
      "end_offset" : 88,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "sleeve",
      "start_offset" : 89,
      "end_offset" : 95,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "slim",
      "start_offset" : 96,
      "end_offset" : 100,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "women",
      "start_offset" : 101,
      "end_offset" : 106,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "Women",
      "start_offset" : 101,
      "end_offset" : 106,
      "type" : "SYNONYM",
      "position" : 17
    },
    {
      "token" : "girl",
      "start_offset" : 101,
      "end_offset" : 106,
      "type" : "SYNONYM",
      "position" : 17
    },
    {
      "token" : "girls",
      "start_offset" : 101,
      "end_offset" : 106,
      "type" : "SYNONYM",
      "position" : 17
    },
    {
      "token" : "short",
      "start_offset" : 107,
      "end_offset" : 112,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "shirt",
      "start_offset" : 113,
      "end_offset" : 118,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "shirts",
      "start_offset" : 113,
      "end_offset" : 118,
      "type" : "SYNONYM",
      "position" : 19
    }
  ]
}

可以看到，shirt、women两个字符串经过分析器后被分词为了shirt, shirts以及 women, Women, girl, girls两组分词，且都做了相应标识。

参考文章

同义词搜索原理部分参考

https://blog.csdn.net/woshixubo123/article/details/121774972

以及

https://blog.csdn.net/woshixubo123/article/details/121898514

两篇文章

其他均来自于官网或者自己举的例子

版权声明：
作者：玉兰
链接：https://www.techfm.club/p/50491.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

GitHub

二维码

朝气蓬勃岁月

< <上一篇

深夜宜瞎想

下一篇>>

搜索内容

ES近义词匹配

ES近义词匹配

近义词表格式

如何使用近义词表进行查询

建立索引

参数解释

使用案例

构建索引

在相应路径中存入近义词文件

存入测试数据

简单应用

简单尝试一下近义词库查询

删除数据

结论

动态更新近义词文件

插件地址

插件使用方法

同义词查询原理

分词

同义词过滤器

我们使用官方的`whitespace`分析器来看一下分词情况：

我们再来试试自定义的分析器

参考文章

取消回复

共有 0 条评论

Ads

ES近义词匹配

ES近义词匹配

近义词表格式

如何使用近义词表进行查询

建立索引

参数解释

使用案例

构建索引

在相应路径中存入近义词文件

存入测试数据

简单应用

简单尝试一下近义词库查询

删除数据

结论

动态更新近义词文件

插件地址

插件使用方法

同义词查询原理

分词

同义词过滤器

我们使用官方的whitespace分析器来看一下分词情况：

我们再来试试自定义的分析器

参考文章

取消回复

共有 0 条评论

Ads

我们使用官方的`whitespace`分析器来看一下分词情况：