I have a reasonable number of nodes (roughly 60,000)
(:Document {title:"A title"})
Given a title, I want to find the matching node, if it exists. The problem is that the title I'm given is not consistent. That is, sometimes the beginning of a new word is Capital, sometimes it is all lower case. Sometimes Key-Words are combined with Kebab case, sometimes they are written normally as key words.
To compensate for that I use apoc and the Levenshtein distance between the given title and every node and only accept a node as a match if it is below some threshold:
MATCH (a:Document)
WHERE apoc.text.distance(a.title, "A title") < 10
RETURN a
This does not scale well. Currently a single lookup takes about 700 ms which is too slow, given that this will likely grow to somewhere around 150,000 nodes.
I was thinking of storing / caching the occurrence of alternative titles in an alias:[...]
property of the node and build an index over all aliases, but I don't know if this is possible in Neo4j.
What is the fastest way to "fuzzy find" a title given a large database of nodes?
In Neo4j 3.5 (currently on beta03), there are FTS (Full-Text Search) capabilities.
EDIT : I have written a detailed blog post about FTS in Neo4j : https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html
You can query then your documents using the Lucene Classic Query Parser Syntax.
Create the index :
Import some documents :
Query documents with title containing "heavy toll"
Query for same title with a typo :
Notice the escaping of the quotes => \" , the string passed to the underlying parser should contain the quotes in order to perform a phrase query instead of a boolean query.
Also the
tidle
next to the terms indicate to perform a fuzzy search using the Damarau-Levenshtein algo.Indexing as noted in the answer by Christophe Willemsen is definitely needed for speeding up the search but I would also like to point a another historic function that might be a better fit with your "fuzzy find":
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.