Im using neo4j 3.0.1 community, and i have a few GBs of data. Those data very quickly become outdated (like 2,3 times per day) and i have to create new data first, and then delete the old stuff (so at any point in time some data are available).
The problem is that neo4j doesnt reuse space from deleted nodes/relationships. Im using MATCH (n) WHERE condition DETEACH DELETE n
I can see that nodes are beeing deleted (their number is constant ~30M) but the size is growing (after 12 updates, size is almost exactly 12x bigger than it should be).
I found previous posts Neo4J database size / shrinking about store-utils but i would like to find a better solution.
I also found old question (from version 1.x) neostore.* file size after deleting millions node but it simply doesnt work like in the answer at least in my case.
There are some advices to delete all database files and just create a new one, but it would require the service to be stopped which shouldn't happen.
I also found some information that in order to reuse space you need to restart DB first, tried it as well and it didn't work.
Is there a way to effectively free/reuse space from deleted nodes/relationships ? Maybe i miss some configuration, or its available only in enterprise version?
EDIT:
Finally i had some time to test and i run scenario when data were refreshed a few times, restarting server a few times aswell. Test were made on neo4j 3.0.0 on windows 10 environment. The results are(not yet allowed to embeed images):
Each column presents storage size for further updates, blue line means neo4j server restart, and last column (separated with brown line) stands for size after running store-utils.
As desribed earlier, the size is growing pretty fast and against the documentation, restart doesn't help. Only store-utils helps (they clean files except neostore.nodestore.db) but it would be hard and messy solution to integrate store-utils to production solution.
Can anyone give me a hint why the storage is growing ?
Starting with Neo4j 3.0.4, Enterprise Edition does support reuse for node ids and relationship ids without the need to restart the instance. This works both for single instance and HA deployments.
To enable that feature you need to set the following in
neo4j.conf
:You can restart your server after you created your new data, so the next time you create data it will reuse the blocks you freed the previous time, this leaves you only with 2x the volume (if you have to keep the data first before deleting it).
You should still use store-utils to compact your store for the first time.
After heavy testing I finally found main source of the problem - it turns out that I was doing a hard shutdown on neo4j server which he cannot handle and in result he struggled with deleting nodes/relationships and reusing space after them.
Lets start from the beginning. I was using neo4j under docker (with docker compose). My scenario was very simple, every few hours i'm starting a process where i'm adding a few GB's of nodes, and after it's done i'm removing nodes from previous process (very briefly). Sometimes i have to update neo4j plugin or do some jobs that requires me to restart server and that's where the problem starts. I'm restarting it with docker-compose an it never waits for neo4j to gracefully quit(by default, i have to customize it now when i know about the problem), instead he kills him immediatelly. In debug.log there is no trace of stopping the server. Neo4j doesn't handle it and in result he does very strange thing. When I start server he rollbacks the nodeId counter, relationshipId counter and others and doesn't free the space after nodes/relationships but at least he never rollbacks nodes and relationships itself. Of course my delete operations were successfully committed in a transaction, so it's not a case of reverting uncommitted change. After a few restarts and imports i have a database size multiplied by number of imports. Also node counters are heavily overstated.
I realize that it's mostly my fault that i was killing neo4j, but still the behaviour is not ideal in my opinion.
There is also another related issue. I performed almost 24h test without restarts during which i was repeating my scenario over 20 times. I was very suprised about growing time of each import (skipping growing database size issue)
import nr. | creating nodes time | deleting nodes time
1 | 20 minutes | 0 min (nothing to delete yet)
2 | 20 minutes | 8 minutes
3 | 20 minutes | 12 minutes
...
~20 | 20 minutes | over 80 minutes
As you can see, nodes/relationships very probably not deleted immediatelly (maybe they are actually deleted during stop/start) and my delete script have to do a lot of extra work.
That's my code for removing:
I'm probably able to solve issue with killing neo4j (adding some script that will ensure that neo4j is able to gracefully stop) when i'm restarting docker image, but not sure if there is a way to handle growing size and time of deleting (unless i restart neo4j after every update).
I'm describing the issue so maybe it will help somebody someday, or help neo4j team to improve their product beacuse it's most enjoyable DB i've ever worked with, despite the issues i have to deal with.