C# serialize large array to disk

2020-07-20 03:05发布

问题:

I have a very large graph stored in a single dimensional array (about 1.1 GB) which I am able to store in memory on my machine which is running Windows XP with 2GB of ram and 2GB of virtual memory. I am able to generate the entire data set in memory, however when I try to serialize it to disk using the BinaryFormatter, the file size gets to about 50MB and then gives me an out of memory exception. The code I am using to write this is the same I use amongst all of my smaller problems:

StateInformation[] diskReady = GenerateStateGraph();
BinaryFormatter bf = new BinaryFormatter();
using (Stream file = File.OpenWrite(@"C:\temp\states.dat"))
{
    bf.Serialize(file, diskReady);
}

The search algorithm is very lightweight, and I am able to perform searches on this graph with no problems once it is in memory.

I really have 3 questions:

  1. Is there a more reliable way to write a large data set to disk. I guess you can define large as when the size of the data set approaches the amount of available memory, though I am not sure how accurate that is.

  2. Should I move to a more database centric approach?

  3. Can anyone point me to some literature on reading portions of a large data set from a disk file in C#?

回答1:

Write entries to file yourself. One simple solution would be like:

StateInformation[] diskReady = GenerateStateGraph();
BinaryFormatter bf = new BinaryFormatter();
using (Stream file = File.OpenWrite(@"C:\temp\states.dat"))
{
  foreach(StateInformation si in diskReady)
    using(MemoryStream ms = new MemoryStream())
    {
      bf.Serialize(ms, diskReady);
      byte[] ser = ms.ToArray();
      int len = ser.Length;
      file.WriteByte((byte) len & 0x000000FF);
      file.WriteByte((byte) (len & 0x0000FF00) >> 8);
      file.WriteByte((byte) (len & 0x00FF0000) >> 16);
      file.WriteByte((byte) (len & 0x7F000000) >> 24);
      file.Write(ser, 0, len);
    }
}

No more than the memory for a single StateInformation object's memory is needed at a time, and to deserialise you read four bytes, construct the length, create a buffer of that size, fill it, and deserialise.

All of the above could be seriously optimised for speed, memory use and disk-size if you create a more specialised format, but the above goes to show the principle.



回答2:

My experience of larger sets of information like this is to manually write it to disk, rather than using built in serialization.

This may not be pratical depending on how complex you're StateInformation class is, but if it is fairly simple you could write/read the binary data manually using a BinaryReader and BinaryWriter instead. These will allow you to read/write most value types directly to the stream, in an expected predetermined order dictated by your code.

This option should allow you to read/write your data quickly, although it is awkward if you then wish to add information into the StateInformation at a later date, or to take it out as you'll have to manage upgrading your files.



回答3:

What is contained in StateInformation? Is it a class? struct?

If you are simply worried about an easy to use container format that is easily serializable to disk - created a typed DataSet, store the information into the DataSet, then use the WriteXml() method on the DataSet to persist it to disk. You can then create the empty DataSet, and then use ReadXml() to load the contents back into memory.

If StateInformation is in a struct with value types, you can look at MemoryMappedFile to store/use the contents of the array by referencing the file directly, treating it as memory. This approach is quite a bit more complicated than the DataSet, but has its own set of advantage.