In light of this article, I am wondering what people's experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields instead of instantiating millions of objects and racking up the memory overhead (say, 12-24 bytes per object, depending which article you read). Data per property varies from item to item so I can't use a strict Flyweight pattern but would envision something similar.
My idea of this sort of representation is that one has a 'template object'...
class Thing
{
double A;
double B;
int C;
string D;
}
And then a container object with a method of creating an object on request...
class ContainerOfThings
{
double[] ContainerA;
double[] ContainerB;
int[] ContainerC;
string[] ContainerD;
ContainerOfThings(int total)
{
//create arrays
}
IThing GetThingAtPosition(int position)
{
IThing thing = new Thing(); //probably best done as a factory instead
thing.A = ContainerA[position];
thing.B = ContainerB[position];
thing.C = ContainerC[position];
thing.D = ContainerD[position];
return thing;
}
}
So that's a simple strategy but not very versatile, for example one can't create a subset (as a List) of 'Thing' without duplicating data and defeating the purpose of array field storage. I haven't been able to find good examples, so I would appreciate either links or code snippets of better ways to handle this scenario from someone who's done it...or a better idea.
In light of this article, I am wondering what people's experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields instead of instantiating millions of objects and racking up the memory overhead...
I guess there are several ways to approach this, and indeed you are onto a possible solution to limit the data in memory. However, I'm not sure that reducing your structure by even 24? bytes is going to do you a whole lot of good. Your structure is around 79 bytes (for a 15 char string) = 8 + 8 + 4 + 24? + 4 + 1 + (2 * character length) so your total gain is at best 25%. That doesn't seem very useful since you'd have to be in a position where 10 million * 80 bytes fits in memory and 10 million * 100 bytes does not. That would mean that your designing a solution that is on the edge of disaster, too many large strings, or too many records, or some other program hogging memory and your machine is out of memory.
If you need to support random access to n small records, where n = 10 million, then you should aim to design for at least 2n or 10n. Perhaps your already considering this in your 10 million? Either way there are plenty of technologies that can support this type of data being accessed.
One possibility is if the string is limited in Max Length (ml), of a reasonable size (say 255) then you can go to a simple ISAM store. Each record would be 8 + 8 + 4 + 255 bytes and you can simply offset into a flat file to read them. If the record size is variable or possibly large then you will want to use a different storage format for this and store offsets into the file.
Another possibility is if your looking up values by some key then I would recommend something like an embedded database, or BTree, one you can disable some of the disk consistency to gain the performance. As it happens I wrote a BPlusTree for client-side caches of large volumes of data. Detailed information on using the B+Tree are here.
Actually the ADO.NET DataTable uses similar approach to store the data. Maybe you should look how it is implemented there.
So, you'll need to have a DataRow-like object that internally holds pointer to Table and index of the row data. This would be the most lightweight solution I beleive.
In your case:
a) If you are constructing the Thing each time you call the GetThingAtPosition method you create the object in the heap, that doubles information that is already in your table. Plus "object overhead" data.
b) If you need to access each item in your ContainerOfThings the required memory will be doubled + 12bytes * number of objects overhead. In such scenario it would be better to have a simple array of things without creating them on-the-fly.
Your question implies there is a problem. Has the memory usage proved to be a problem?
If 100 bytes per item then it sounds like 1GB. So I'm wondering about the app and if this is a problem. Is the app to run on a dedicated 64 bit box with, say, 8GB or ram?
If there is a fear, you could test the fear by an integration test. Instantiate say 20 million of these items and run some performance tests.
But of course it does all come down the app domain. I have had specialised apps that use more RAM than this and have worked fine. Cost of hardware is often way less than the cost of software (yea it comes down to app domain again).
See ya
Unfortunately, OO can't abstract away the performance issues (saturation of bandwidth being one). It's a convenient paradigm, but it comes with limitations.
I like your idea, and I use this as well... and guess what, we're not the first to think of this ;-). I've found that it does require a bit of a mind shift though.
May I refere you to the J community? See:
http://www.JSoftware.com.
That's not a C# (or Java) group. They're a good bunch. Typically the array needs to be treated as a first class object. In C#, it's not nearly as flexible. It can be a frustrating structure to work withing C#.
There are various OO patterns for large dataset problems... but if you are asking a question like this, probably it is time to go a little more functional. Or at least functional for problem solving / prototyping.
I've done such a thing for the rapidSTORM project, where several million sparsely populated objects need to be cached (localization microscopy). While I can't really give you good code snippets (too many dependencies), I found that the implementation was very quick and straightforward with Boost Fusion. Fusionized the structure, built a vector for each element type, and then wrote a quite straightforward accessor for that vector that reconstructed each element.
(D'oh, I just noticed that you tagged the question, but maybe my C++ answer helps as well)
[update 2011-07-19]
there's a new version available now: http://www.mediafire.com/file/74fxj7u1n0ppcq9/MemStorageDemo-6639584-2011_07_19-12_47_00.zip
i am still trying to debug some of the refcounting which is annoying, but from a fresh xUnit session, i am able to run a test that creates 10 million objects (it occurrs to me i have the string size cut down right now for testing but i had it running 10 million with strings of variable length from 3 to 15 bytes, i haven't had a chance to try bigger than that yet. on my system i was going from approx ~1.95G load to ~2.35G with the 10 million objects and i still haven't done anything about the strings other than a very simple backing class that uses actual managed strings.
anyway, so, i think it's fairly well working, although there's definitely optimizing left to be done on the backing store, and i also think a bit of work can be done on the iterators if necessary, depending on how much data you process at once. unfortunately just not going to be able to look at it again until probably late tomorrow or next day.
anyway, so here's the basic ideas:
MemoryArray
class: uses unmanaged memory allocated by Marhsal.AllocHGlobal()
to store the structures. i was testing with one big MemoryArray
but member-wise there are only a few fields and i think as long as you kept the array size fairly large it's not going to be much of a difference in memory consumption to split it up. MemoryArray
implements IEnumerable<MemoryArrayItem>
which is what i've been using to fill the arrays in testing.
MemoryArray
is intended for holding the regularly-sized chunks of data from any object being backed. there are some things you can do with the enumerators using pointer math that i have not yet implemented. i am currently returning new objects every time, so that's a big chunk of the size i believe. in the prototype class i based this prototype on, i was able to use very regular pointer math for doing traversals, however i think the way i was doing it would be mainly useful for very quick traversals and probably not for interoperability.
MemoryArray
has just a standard indexer that grabs the requested element using pointer math on the Int
Ptr that represents the head of the unmanaged data which is allocated upon construction. I have also implemented a notional 2D indexer, where you can pass in an array of ints representing dimensions to the Array, and then can perform a A[x,y] on it. it's just a quick little example of how it could work.
one thing i haven't implemented yet is any sort of subsectioning, but i do think a subsection is appropriate for the project so i will probably implement that when i get a chance.
MemoryArrayEnumerator
class: i chose to implement actual enumerators instead of enumerator functions. the enumerator class basically just takes a MemoryArray
and then provides the stock functions for enumerators to return the actual MemoryArrayItem
objects.
MemoryArrayItem
class: this class doesn't do much other than just hold the appropraite pointer information based off the start and position in the array. it's the specific classes that get implemented on top of (to the side of?) this object that actually do the pointer stuff to get the data out.
then there are a few more support classes, MemoryStringArray
is the variable-sized memory backing chunk that only does strings for the time being, and then there's the auto disposing class (AutoDisposer
) and a generic class to handle the attaches and detaches. (AutoReference<>
).
now, on top of this foundation are the specific classes. each of the three types (array/enumerator/item) are implemented specifically for the objects you're looking at. in the more massive version of this project that i gave up on and that this is a spike for, i have more generic handling of the offsets and such so that you're not so tied to concrete classes, but even as they are they are pretty useful; my original implementation was all concrete classes with no real base.
currently i have them all implemented as separate classes that you pass references to; so, TestArray
class is passed a MemoryArray
class in its constructor. same with enumerator and item. i know there would be some processing power to be saved there, and i think there's a good possibility for space savings as well, if i can figure out a good way to implement them as descendants rather than just having a copy of the underlying classes. i wanted to get the basic feel for it first though, and that seemed to be the most straightforward way to go. the issue is that it's another layer of indirection.
TestArray
and TestArrayEnumerator
turned out not to do too much other than just pass through the functionality of MemoryArray
and MemoryArrayEnumerator
. the main issue in those classes is just getting the pointers carved up and passed out and into the items that are using it.
but, so TestArrayItem is where the pointers are actually switched into real data; here's the file. i snipped a big section of comments that goes through some options for a better way to handle the variable-length backing store (still in the actual file in the link given above), and please excuse the multitude of comments where i leave myself notes about what i'm thinking as i work on it :)
TestArrayItem.cs
// ----------------------------------------------
// rights lavished upon all with love
// see 'license/unlicense.txt'
// ♥ 2011, shelley butterfly - public domain
// ----------------------------------------------
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace MemStorageDemo
{
public unsafe class TestArrayItem
{
public static MemoryStringArray s_MemoryArrayStore = new MemoryStringArray();
public static int TestArrayItemSize_bytes =
sizeof(double) * 2
+ sizeof(int)
+ sizeof(int);
// hard-coding here; this is another place that things could be a little more generic if you wanted and if
// performance permitted; for instance, creating a dictionary of offsets based on index. also, perhaps
// a dictionary with key strings to allow indexing using field names of the object.
private enum EFieldOffset
{
DoubleTheFirstOffset = 0,
DoubleTheSecondOffset = 8,
IntTheFirstOffset = 16,
StringTheFirstHandleOffset = 20
}
private MemoryArrayItem myMemoryArrayItem;
private MemoryStringArray myStringStore;
// constructor that uses the static string array store
public TestArrayItem(MemoryArrayItem parMemoryArrayItem) :
this(parMemoryArrayItem, s_MemoryArrayStore)
{
}
// constructor for getting the item at its memory block without any initialization (e.g. existing item)
public TestArrayItem(MemoryArrayItem parMemoryArrayItem, MemoryStringArray parStringStore)
{
myMemoryArrayItem = parMemoryArrayItem;
myStringStore = parStringStore;
}
// constructor for geting the item at its memory block and initializing it (e.g. adding new items)
public TestArrayItem(MemoryArrayItem parMemoryArrayItem, double parDoubleTheFirst, double parDoubleTheSecond, int parIntTheFirst, string parStringTheFirst)
{
myMemoryArrayItem = parMemoryArrayItem;
DoubleTheFirst = parDoubleTheFirst;
DoubleTheSecond = parDoubleTheSecond;
IntTheFirst = parIntTheFirst;
StringTheFirst = parStringTheFirst;
}
// if you end up in a situation where the compiler isn't giving you equivalent performance to just doing
// the array math directly in the properties, you could always just do the math directly in the properties.
//
// it reads much cleaner the way i have it set up, and there's a lot less code duplication, so without
// actually determining empirically that i needed to do so, i would stick with the function calls.
private IntPtr GetPointerAtOffset(EFieldOffset parFieldOffset)
{ return myMemoryArrayItem.ObjectPointer + (int)parFieldOffset; }
private double* DoubleTheFirstPtr
{ get { return (double*)GetPointerAtOffset(EFieldOffset.DoubleTheFirstOffset); } }
public double DoubleTheFirst
{
get
{
return *DoubleTheFirstPtr;
}
set
{
*DoubleTheFirstPtr = value;
}
}
private double* DoubleTheSecondPtr
{ get { return (double*)GetPointerAtOffset(EFieldOffset.DoubleTheSecondOffset); } }
public double DoubleTheSecond
{
get
{
return *DoubleTheSecondPtr;
}
set
{
*DoubleTheSecondPtr = value;
}
}
// ahh wishing for a preprocessor about now
private int* IntTheFirstPtr
{ get { return (int*)GetPointerAtOffset(EFieldOffset.IntTheFirstOffset); } }
public int IntTheFirst
{
get
{
return *IntTheFirstPtr;
}
set
{
*IntTheFirstPtr = value;
}
}
// okay since we're using the StringArray backing store in the example, we just need to get the
// pointer stored in our blocks, and then copy the data from that address
private int* StringTheFirstHandlePtr
{ get { return (int*)GetPointerAtOffset(EFieldOffset.StringTheFirstHandleOffset); } }
public string StringTheFirst
{
get
{
return myStringStore.GetString(*StringTheFirstHandlePtr);
}
set
{
myStringStore.ModifyString(*StringTheFirstHandlePtr, value);
}
}
public void CreateStringTheFirst(string WithValue)
{
*StringTheFirstHandlePtr = myStringStore.AddString(WithValue);
}
public override string ToString()
{
return string.Format("{0:X8}: {{ {1:0.000}, {2:0.000}, {3}, {4} }} {5:X8}", (int)DoubleTheFirstPtr, DoubleTheFirst, DoubleTheSecond, IntTheFirst, StringTheFirst, (int)myMemoryArrayItem.ObjectPointer);
}
}
}
so, that's the real magic; just basically implementing functions that figure out the right pointers based on the info about the fields. as it stands i think it's a pretty good candidate for code gen, well, assuming i get the attach/detach stuff working right. i hid a lot of having to do manual memory management using auto-type pointers and i think it will be worth the debug in the long run...
anyway, that's about it, i'll hopefully be back with internet this evening or tomorrow, and i'll do my best to check in. unfortunately about to go give the cable modem back to the cable company so won't be in business unless we go to a mcdonalds or something :) hope this has been some sort of help; i'll debug the issues with it so that there's at least a functional base to work from; i know this isn't the first time i've thought about writing a library like this and i imagine others have too.
previous content
i have used something like this for using a COM library that i created to interop with the pre-compiled win32/64 FFTW dll. for this question we really need something a little more generic than what i had, so i started working on something last week that would be sufficient as a decent generic library for these types of uses, with extensible memory management, multi-dimensions, slicing, etc.
well, i finally admimtted to myself yesterday that (a) it was going to be another few days before it was ready and (b) i needed a spike solution to figure out some of the last bits anyway. so, i decided to just do a first cut at a lower level of abstraction that attempts to meed the needs addressed in your question. unfortunately we are about to move and i have to do some packing but i think i'ts probably enough to address your question.
i am going to continue work on the example this week and i will update here with new code and i will attempt to distill enough info from it to make a post that explains it, but if you're still interested, here's the last i'll probably be able to post until later in the week:
http://www.mediafire.com/file/a7yq53ls18q7bvf/EfficientStorage-6639584.zip
it's just a VS2010 solution with the basic necessities for doing what i think meets your needs. there are lots of comments sprinkled through, but feel free to ask questions and i will check back as soon as i have internet... either way it's been a fun project.
after completing the example i intend to finish off the first iteration full library and release it somewhere; i will update here with the link when i have it.
obligatory warning: it compiles but has not been tested; i'm sure there are issues, it's a pretty big topic.