I have an excel sheet with a million rows. Each row has 100 columns.
Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.
What data structure is the most optimal for use here, to store the million instance of data?
Thanks
It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.
- If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
- If you access by index, the simplest thing is an array of arrays.
- If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
- If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...
One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.
A more efficient way of storing the data is by column. i.e. you have array for each column with a primitive data type. I suspect you don't even need to do this.
If you have many more rows e.g. billions, you can use off heap memory i.e. memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (e.g. 100s of GB off-heap with 1 GB in heap)
If you want to store all the data in memory, you can use one of the implementations of Table
from Guava, typically ArrayTable
for dense tables or HashBasedTable
if most cells are expected to be empty. Otherwise, a database (probably with some cache system like ehcache or terracota) would be a better shot.
Your best option would be to store them in a table in an actual database, like Postgres etc. These are optimised to work for what you are talking about!
In that kind of data i would prefer using a MYSQL database because it is faster and can accumulate a large file like that.
The best option would be using a database that can store large number of data and fast enough for faster accessibility like ORACLE, MSSQL, MYSQL and any other databases that are fast and can store large amount of data.
If you really have a million rows or more with 100 values each, I doubt it will all fit into your memory... or is there a special reason for it? For example poor performance using a database?
Since you wnat to have random access, I'd use a persistence provider like hibernate and some database you like (for example mysql).
But be aware that the way you use the persistence provider has a great impact on performance. For example you should use batch-inserts (which are incompatible with autogenerated ids).