In light of this article, I am wondering what people's experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields instead of instantiating millions of objects and racking up the memory overhead (say, 12-24 bytes per object, depending which article you read). Data per property varies from item to item so I can't use a strict Flyweight pattern but would envision something similar.
My idea of this sort of representation is that one has a 'template object'...
class Thing
double A;
double B;
int C;
string D;
And then a container object with a method of creating an object on request...
class ContainerOfThings
double[] ContainerA;
double[] ContainerB;
int[] ContainerC;
string[] ContainerD;
ContainerOfThings(int total)
//create arrays
IThing GetThingAtPosition(int position)
IThing thing = new Thing(); //probably best done as a factory instead
thing.A = ContainerA[position];
thing.B = ContainerB[position];
thing.C = ContainerC[position];
thing.D = ContainerD[position];
return thing;
So that's a simple strategy but not very versatile, for example one can't create a subset (as a List) of 'Thing' without duplicating data and defeating the purpose of array field storage. I haven't been able to find good examples, so I would appreciate either links or code snippets of better ways to handle this scenario from someone who's done it...or a better idea.
You make an Array of System.Array with an element for each property in your type. The size of these sub-arrays is equal to the number of objects you have. Property access would be:
This will allow you to use value type arrays instead of arrays of object.
It depends on your concrete scenario. Depends on how often your objects are created, you can:
If objects are serializable save them in MemoryMappedFile (obtaining some fusion of middle/low performance and low memory consumption).
Map th fields between different objects: I mean if object initially have default values, have all them in separate base and really allocate a new space if that value becomes different from default one. (this make sense for reference types naturally).
Another solution again save objects to SqlLite base. Much easier to manage than MemoryMappedFiles as you can use simple SQL.
The choice is up to you, as it depends on your concrete project requierements.
I guess there are several ways to approach this, and indeed you are onto a possible solution to limit the data in memory. However, I'm not sure that reducing your structure by even 24? bytes is going to do you a whole lot of good. Your structure is around 79 bytes (for a 15 char string) = 8 + 8 + 4 + 24? + 4 + 1 + (2 * character length) so your total gain is at best 25%. That doesn't seem very useful since you'd have to be in a position where 10 million * 80 bytes fits in memory and 10 million * 100 bytes does not. That would mean that your designing a solution that is on the edge of disaster, too many large strings, or too many records, or some other program hogging memory and your machine is out of memory.
If you need to support random access to n small records, where n = 10 million, then you should aim to design for at least 2n or 10n. Perhaps your already considering this in your 10 million? Either way there are plenty of technologies that can support this type of data being accessed.
One possibility is if the string is limited in Max Length (ml), of a reasonable size (say 255) then you can go to a simple ISAM store. Each record would be 8 + 8 + 4 + 255 bytes and you can simply offset into a flat file to read them. If the record size is variable or possibly large then you will want to use a different storage format for this and store offsets into the file.
Another possibility is if your looking up values by some key then I would recommend something like an embedded database, or BTree, one you can disable some of the disk consistency to gain the performance. As it happens I wrote a BPlusTree for client-side caches of large volumes of data. Detailed information on using the B+Tree are here.
Actually the ADO.NET DataTable uses similar approach to store the data. Maybe you should look how it is implemented there. So, you'll need to have a DataRow-like object that internally holds pointer to Table and index of the row data. This would be the most lightweight solution I beleive.
In your case: a) If you are constructing the Thing each time you call the GetThingAtPosition method you create the object in the heap, that doubles information that is already in your table. Plus "object overhead" data.
b) If you need to access each item in your ContainerOfThings the required memory will be doubled + 12bytes * number of objects overhead. In such scenario it would be better to have a simple array of things without creating them on-the-fly.
Your question implies there is a problem. Has the memory usage proved to be a problem?
If 100 bytes per item then it sounds like 1GB. So I'm wondering about the app and if this is a problem. Is the app to run on a dedicated 64 bit box with, say, 8GB or ram?
If there is a fear, you could test the fear by an integration test. Instantiate say 20 million of these items and run some performance tests.
But of course it does all come down the app domain. I have had specialised apps that use more RAM than this and have worked fine. Cost of hardware is often way less than the cost of software (yea it comes down to app domain again).
See ya
Unfortunately, OO can't abstract away the performance issues (saturation of bandwidth being one). It's a convenient paradigm, but it comes with limitations.
I like your idea, and I use this as well... and guess what, we're not the first to think of this ;-). I've found that it does require a bit of a mind shift though.
May I refere you to the J community? See:
That's not a C# (or Java) group. They're a good bunch. Typically the array needs to be treated as a first class object. In C#, it's not nearly as flexible. It can be a frustrating structure to work withing C#.
There are various OO patterns for large dataset problems... but if you are asking a question like this, probably it is time to go a little more functional. Or at least functional for problem solving / prototyping.