NoSql solution to store 20[TB] of data, as vector/

2019-07-03 16:04发布

问题:

I need to build a system to efficiently store & maintain a huge amount (20 [TB]) of data (and be able to access it in 'vector' form). Here are my dimensions:

(1) time (given as an integer of the form YYYYMMDDHHMMSS)

(2) field (a string of any given length, representing a name of a hospital)

(3) instrumentID (an integer representing a uniqueID for the instrument)

I will need a way to be able to store data individually, meaning, something like:

STORE 23789.46 as the data for instrumentID = 5 on field = 'Nhsdg' on time = 20040713113500

Yet, I would need the following query to to run FAST: give me all instruments for field 'X' on timestamp 'Y'.

In order to build these systems, I am given 60 duo-core machines (each with 1GB of RAM, 1.5TB disk)

Any recommendation on a suitable NoSQL soltuion (that would ideally work with python)?

NOTE: the system will first store historical data (which is roughly 20[TB]). Every day I will add just about 200[MB] at most. I just need a solution that would scale and scale. my use case would be just a simple query: give me all instruments for field 'X' on timestamp 'Y'

回答1:

MongoDB scales awesomely and supports many of the indexing features you'd typically find in an RDBMS such as compound key indexes. You can use a compound index on the name and time attributes in your data. Then you can retrieve all instrument readings with a particular name and date range.

[Now in the simple case where you're strictly interested in just that one basic query and nothing else, you can just combine the name and timestamp and call that your key, which would work in any key-value store...]

HBase is another excellent option. You can use a composite row key on name and date.

As others have mentioned, you can definitely use a relational database. MySQL & PostgreSQL can certainly handle the load and table partitioning might be desirable in this scenario as well since your dealing with time ranges. You can use bulk loading (and disabling indexes during loading) to decrease insertion time.



标签: python nosql