Has anyone successfully used Tokyo Cabinet / Tokyo Tyrant with large datasets? I am trying to upload a subgraph of the Wikipedia datasource. After hitting about 30 million records, I get exponential slow down. This occurs with both the HDB and BDB databases. I adjusted bnum to 2-4x the expected number of records for the HDB case with only a slight speed up. I also set xmsiz to 1GB or so but ultimately I still hit a wall.
It seems that Tokyo Tyrant is basically an in memory database and after you exceed the xmsiz or your RAM, you get a barely usable database. Has anyone else encountered this problem before? Were you able to solve it?
I think I may have cracked this one, and I haven't seen this solution anywhere else. On Linux, there are generally two reasons that Tokyo starts to slow down. Lets go through the usual culprits. First, is if you set your bnum too low, you want it to be at least equal to half of the number of items in the hash. (Preferrably more.) Second, you want to try to set your xmsiz to be close to the size of the bucket array. To get the size of the bucket array, just create an empty db with the correct bnum and Tokyo will initialize the file to the appropriate size. (For example, bnum=200000000 is approx 1.5GB for an empty db.)
But now, you'll notice that it still slows down, albeit a bit farther along. We found that the trick was to turn off journalling in the filesystem -- for some reason the journalling (on ext3) spikes as your hash file size grows beyond 2-3GB. (The way we realized this was spikes in I/O not corresponding to the changes of the file on disk, alongside daemon CPU bursts of kjournald)
For Linux, just unmount and remount your ext3 partition as an ext2. Build your db, and remount as ext3. When journalling was disabled we could build 180M key sized db's without a problem.
Tokyo scales wonderfully!! But you have to set your bnum and xmsiz appropriately. bnum should be .025 to 4 times greater than the records you are planning to store. xmsiz should match the size of BNUM. Also set opts=l if you are planning to store more than 2GB.
See Greg Fodor's post above about getting the value for size of xmsiz. Be careful to note that when setting xmsiz the value is in bytes.
Finally, if you are using a disk based hash it is very, very, VERY important to turn off journaling on the filesystem that the tokyo data lives on. This is true for Linux, Mac OSX and probably Windows though I have not tested it there yet.
If journaling is turned on you will see severe drops in performance as you approach 30+ million rows. With journaling turned off and other options appropriately set Tokyo is a great tool.
Tokyo Cabinet's key-value store is really good. I think people call it slow because they use Tokyo Cabinet's table-like store.
If you want to store document data use mongodb or some other nosql engine.
I am starting to work on a solution to add sharding to tokyo cabinet called Shardy.
http://github.com/cardmagic/shardy/tree/master