Neo4j Fast way to match fuzzy text property

2020-06-04 05:19发布

I have a reasonable number of nodes (roughly 60,000)

(:Document {title:"A title"})

Given a title, I want to find the matching node, if it exists. The problem is that the title I'm given is not consistent. That is, sometimes the beginning of a new word is Capital, sometimes it is all lower case. Sometimes Key-Words are combined with Kebab case, sometimes they are written normally as key words.

To compensate for that I use apoc and the Levenshtein distance between the given title and every node and only accept a node as a match if it is below some threshold:

MATCH (a:Document)
WHERE apoc.text.distance(a.title, "A title") < 10
RETURN a

This does not scale well. Currently a single lookup takes about 700 ms which is too slow, given that this will likely grow to somewhere around 150,000 nodes.

I was thinking of storing / caching the occurrence of alternative titles in an alias:[...] property of the node and build an index over all aliases, but I don't know if this is possible in Neo4j.

What is the fastest way to "fuzzy find" a title given a large database of nodes?

2条回答
老娘就宠你
2楼-- · 2020-06-04 06:17

In Neo4j 3.5 (currently on beta03), there are FTS (Full-Text Search) capabilities.

EDIT : I have written a detailed blog post about FTS in Neo4j : https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html

You can query then your documents using the Lucene Classic Query Parser Syntax.

Create the index :

CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text'])

Import some documents :

LOAD CSV WITH HEADERS FROM "file:///docs.csv" AS row
CREATE (n:Document) SET n = row

Query documents with title containing "heavy toll"

CALL db.index.fulltext.queryNodes('documents', 'title: "heavy toll"')
YIELD node, score
RETURN node.title, score

╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node.title"                                                          │"score"           │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│3.7325966358184814│
└──────────────────────────────────────────────────────────────────────┴──────────────────┘

Query for same title with a typo :

CALL db.index.fulltext.queryNodes('documents', 'title: \\"heavy~ tall~\\"')
YIELD node, score
RETURN node.title, score

Notice the escaping of the quotes => \" , the string passed to the underlying parser should contain the quotes in order to perform a phrase query instead of a boolean query.

Also the tidle next to the terms indicate to perform a fuzzy search using the Damarau-Levenshtein algo.

╒══════════════════════════════════════════════════════════════════════╤═════════════════════╕
│"node.title"                                                          │"score"              │
╞══════════════════════════════════════════════════════════════════════╪═════════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│0.868073046207428    │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Prisons Run by C.E.O.s? Privatization Under Trump Could Carry a Heavy│0.4014900326728821   │
│ Price - The New York Times"                                          │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"‘All Talk,’ ‘No Action,’ Says Trump in Twitter Attack on Civil Rights│0.28181418776512146  │
│ Icon - The New York Times"                                           │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Immigrants Head to Washington to Rally While Obama Is Still There - T│0.24634429812431335  │
│he New York Times"                                                    │                     │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
查看更多
够拽才男人
3楼-- · 2020-06-04 06:17

Indexing as noted in the answer by Christophe Willemsen is definitely needed for speeding up the search but I would also like to point a another historic function that might be a better fit with your "fuzzy find":

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.

查看更多
登录 后发表回答