What scalability problems have you encountered usi

2019-01-12 12:57发布

NoSQL refers to non-relational data stores that break with the history of relational databases and ACID guarantees. Popular open source NoSQL data stores include:

  • Cassandra (tabular, written in Java, used by Cisco, WebEx, Digg, Facebook, IBM, Mahalo, Rackspace, Reddit and Twitter)
  • CouchDB (document, written in Erlang, used by BBC and Engine Yard)
  • Dynomite (key-value, written in Erlang, used by Powerset)
  • HBase (key-value, written in Java, used by Bing)
  • Hypertable (tabular, written in C++, used by Baidu)
  • Kai (key-value, written in Erlang)
  • MemcacheDB (key-value, written in C, used by Reddit)
  • MongoDB (document, written in C++, used by Electronic Arts, Github, NY Times and Sourceforge)
  • Neo4j (graph, written in Java, used by some Swedish universities)
  • Project Voldemort (key-value, written in Java, used by LinkedIn)
  • Redis (key-value, written in C, used by Craigslist, Engine Yard and Github)
  • Riak (key-value, written in Erlang, used by Comcast and Mochi Media)
  • Ringo (key-value, written in Erlang, used by Nokia)
  • Scalaris (key-value, written in Erlang, used by OnScale)
  • Terrastore (document, written in Java)
  • ThruDB (document, written in C++, used by JunkDepot.com)
  • Tokyo Cabinet/Tokyo Tyrant (key-value, written in C, used by Mixi.jp (Japanese social networking site))

I'd like to know about specific problems you - the SO reader - have solved using data stores and what NoSQL data store you used.

Questions:

  • What scalability problems have you used NoSQL data stores to solve?
  • What NoSQL data store did you use?
  • What database did you use prior to switching to a NoSQL data store?

I'm looking for first-hand experiences, so please do not answer unless you have that.

14条回答
干净又极端
2楼-- · 2019-01-12 13:55

We moved part of our data from mysql to mongodb, not so much for scalability but more because it is a better fit for files and non-tabular data.

In production we currently store:

  • 25 thousand files (60GB)
  • 130 million other "documents" (350GB)

with a daily turnover of around 10GB.

The database is deployed in a "paired" configuration on two nodes (6x450GB sas raid10) with apache/wsgi/python clients using the mongodb python api (pymongo). The disk setup is probably overkill but thats what we use for mysql.

Apart from some issues with pymongo threadpools and the blocking nature of the mongodb server it has been a good experience.

查看更多
萌系小妹纸
3楼-- · 2019-01-12 13:55

I switched from MySQL(InnoDB) to cassandra for a M2M system, which basically stores time-series of sensors for each device. Each data is indexed by (device_id,date) and (device_id,type_of_sensor,date). The MySQL version contained 20 millions of rows.

MySQL:

  • Setup in master-master synchronization. Few problem appeared around loss of synchronization. It was stressful and especially in the beginning could take hours to fix.
  • Insertion time wasn't a problem but querying required more and more memory as the data grew. The problem is the indexes are considered as a whole. In my case, I was only using a very thin parts of the indexes that were necessary to load in memory (only few percent of the devices were frequently monitored and it was on the most recent data).
  • It was hard to backup. Rsync isn't able to do fast backups on big InnoDB table files.
  • It quickly became clear that it wasn't possible to update the heavy tables schema, because it took way too much time (hours).
  • Importing data took hours (even when indexing was done in the end). The best rescue plan was to always keep a few copies of the database (data file + logs).
  • Moving from one hosting company to an other was really a big deal. Replication had to be handled very carefully.

Cassandra:

  • Even easier to install than MySQL.
  • Requires a lot of RAM. A 2GB instance couldn't make it run in the first versions, now it can work on a 1GB instance but it's not idea (way too many data flushes). Giving it 8GB was enough in our case.
  • Once you understand how you organize your data, storing is easy. Requesting is a little bit more complex. But once you get around it, it is really fast (you can't really do mistake unless you really want to).
  • If previous step was done right, it is and stays super-fast.
  • It almost seems like data is organized to be backuped. Every new data is added as new files. I personally, but it's not a good thing, flush data every night and before every shutdown (usually for upgrade) so that restoring takes less time, because we have less logs to read. It doesn't create much files are they are compacted.
  • Importing data is fast as hell. And the more hosts you have the faster. Exporting and importing gigabytes of data isn't a problem anymore.
  • Not having a schema is a very interesting thing because you can make you data evolve to follow your needs. Which might mean having different versions of your data at the same time on the same column family.
  • Adding a host was easy (not fast though) but I haven't done it on a multi-datacenter setup.

Note: I have also used elasticsearch (document oriented based on lucene) and I think it should be considered as a NoSQL database. It is distributed, reliable and often fast (some complex queries can perform quite badly).

查看更多
登录 后发表回答