I'm still in the evaluation of Neo4j vs. OrientDB. Most importantly I need Lucene as full-text index engine. So I created on both databases the same schema with the same data (300Mio lines). I'm also experienced with querying different things in both systems. I used the Standard Analyzer on both sides. The OrientDB test query results are all fine and really good in terms of reliability and speed. The speed of Neo4j is also ok but the results are kind of bad in most of the cases. So let's come to the different issues I have with Neo4j Lucene indexing. I always give you an example of how it would look in OrientDB and which result set you should be getting out of the query.
So in these examples, there are Applns that have title(s). Titles are indexed with Lucene in both databases. Applns also have an ID just to demonstrate the ordering. At the end of each query I have some questions about them. It would be great to get some feedback or even answers about them.
Query #0: One word query with no order
Well this query is very simple. It shall be tested how the database behave if there is just a simple word and nothing else. As you can see the Neo4j result is way longer then the one from OrientDB. OrientDB is using TFIDF to keep the results short and more reliable to the actual search. As you can see as first result in OrientDB, there is title with SOLAR. That is totally missing in Neo4j, too.
In Neo4j: START n=node:titles('title:solar') RETURN n.title,n.ID LIMIT 10
SOLAR RADIATION SHIELDING PARTICULATE AND SOLAR RADIATION SHIELDING RESIN MATERIAL DISPERSED WITH ... 38321319
Solar module for cooling solar cells on the underside of a solar panel has air inlet and outlet openings ... 12944121
Solar construction component for solar thermal assemblies, solar thermal assembly, method for operating a solar... 324146113
...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" LIMIT 10
SOLAR 24900187
Solar unit and solar apparatus 1876343
Solar module with solar concentrator 13496706
...
Questions:
- Why is Neo4j not using TFIDF or what do they use instead?
- Is Neo4j able to use some ordering of the keyword match?
- Is it possible to change TFIDF to somethign else in OrientDB?
Query #1: One word query with order by ID
Neo4j is ordering the ID's before using TFIDF. As known from Query#0 Neo4j is not using TFIDF so it's basically just searching via first results of the Lucene query. In OrientDB besides it's still searching by good TFIDF's and then ordering.
In Neo4j: START n=node:titles('title:solar') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
Stackable flat-roof/floor frame for solar panels 318
Method for producing contact for solar cells 636
Solar cell and fabrication method thereof 1217
...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" ORDER BY ID ASC LIMIT 10
Solar unit and solar apparatus 1876343
Solar module with solar concentrator 13496706
SOLAR TRACKER FOR SOLAR COLLECTOR 16543688
...
Questions:
- How would a search in OrientDB look like that should be ordered by the ID and still matching the best TFIDF of them.
- Is there a way in Neo4j to order the Lucene match before ordering by the ID?
Query #2: One word with using a star search
Star search had no influence on the Neo4j results. OrientDB results changed in a good way.
In Neo4j: START n=node:titles('title:solar*') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
Stackable flat-roof/floor frame for solar panels 318
Method for producing contact for solar cells 636
Solar cell and fabrication method thereof 1217
...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER BY ID ASC LIMIT 10
High performance solar methane generator 8354701
All-plastic honeycomb solar water-heater 8355379
Plate type solar energy heat collector plate core and its manufacturing method 8356173
...
Questions:
- Does Neo4j ignore star searches?
Query #3: Searching for 2 words devided by a space
The strange here is that you need to change 'title:solar panel' to that query here. Otherwhise you just get errors. OrientDB seems good so far.
In Neo4j: START n=node:titles(title="solar panel") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
- Returned 0 rows in 817 ms
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar panel" ORDER BY ID ASC LIMIT 10
SOLAR PANEL 1584567
SOLAR PANEL 1616547
SOLAR PANEL 2078382
SOLAR PANEL 2078383
Solar panel 2178466
...
Questions:
- Why does Neo4j need a special Query here to at least don't throw any error?
- Why is the query failing and not giving anything back? I know that Neo4j is searching here for lower letters, so it's case sensitive. But why it is like this? I mean I use the default analyzer and the doc of Neo4j Lucene says it's true, so it means to_lower_letter.
Query #4: Now searching for the same query in capital letters
The same issue like in #3. In Neo4j just searching returning the capital letters results of the words. OrientDB results looking fine again.
In Neo4j: START n=node:titles(title="SOLAR PANEL") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
SOLAR PANEL 348800
SOLAR PANEL 420683
SOLAR PANEL 1393804
SOLAR PANEL 1584567
SOLAR PANEL 1616547
...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL" ORDER BY ID ASC LIMIT 10
SOLAR PANEL 1584567
SOLAR PANEL 1616547
SOLAR PANEL 2078382
SOLAR PANEL 2078383
Solar panel 2178466
...
Questions:
- Same question like in #3, how to search with to_lower_letter?
Query #5: Combining two words and using the star search
Here I want to combine words search with star search. But with the equal search I'm not able to find matches because he expects the star as usual sign in the title. But I'm not able to say 'title:SOLAR PANEL*'. That's also forbidden. In OrientDB everything is fine.
In Neo4j: START n=node:titles(title="SOLAR PANEL*") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
- Returned 0 rows in 895 ms
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL*" ORDER BY ID ASC LIMIT 10
SOLAR PANELS 1405717
SOLAR PANEL 1584567
SOLAR PANEL 1616547
SOLAR PANEL 2705081
Solar Panel 2766555
...
Questions:
- How can you combine some words with the star search in Neo4j?
Query #6: Counting query results
The last thing I really need is a fast lookup how many results are there overall. Here Neo4j is finding a result way faster but always finding less matches then OrientDB. Searching for Solar is kind of close to each other. But another test was not that close.
In Neo4j: START n=node:titles("title:Solar") RETURN count(*)
143211 in 220 sec
In OrientDB: SELECT count(*) title FROM Appln WHERE title LUCENE "Solar" LIMIT -1
148029 in 50 sec
Questions:
- How can that lookup times be improved on both systems?
- Why does both systems find different number of matches? Also happens on other keywords. Maybe other indexing eninge used?
Well that is everything for now. If you need any other query just tell me and I deliver it. I think it's very important to compare the Lucene implementation because with Millions of nodes Lucene has to many advantages. Thanks for any small tip.
Btw: please don't give tips about using Java code instead for the query. I want to use Cypher because the request shall be done in the browser, like in OrientDB. I know that everything here is easily be done with Java code. Thank you.
Well, I want to share what I found out about my issues until now:
Infos about Query #0,#1 and #2:
In OrientDB ordering before searching is currently slow.
The reason for it's being that slow is that is does not corresponds with Lucene.
Fixing Query #3,#4 and #5:
the query is not correct. The equal is a direct match and not the fuzzy one. So
needs to be replaced by
Really bad way that you need to escape things in the cypher. Here the order of the two words are important. But there is another way to say it
but also really bad if you image you have a string and just ask Neo4j for results, you need a parser. But here the order of the words does not matter.
Fixing Query #6:
OrientDB is currently working on making the counting faster (milliseconds). Planned in the 2.0 Release in some days.
Neo4j has no plans about this.