可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I need to work with very large arrays of small types (int or float arrays), i'm targeting X64 only on machines with a lot of ram, physical memory is never the issue in my scenarios. While looking at the doc for gcAllowVeryLargeObjects i noticed this point :
•The maximum index in any single dimension is 2,147,483,591 (0x7FFFFFC7) for byte arrays and arrays of single-byte structures, and 2,146,435,071 (0X7FEFFFFF) for other types.
Now my issue is i actually "need" to work with larger arrays than that, what would be the appropriate workaround here? creating arrays of arrays or other abstractions?
Knowing i mostly need to access those arrays sequencially (never random reads, but often diferent segments getting read sequencially by diferent threads, potentially 100+ threads at once) what would my best bet be?
I may need to hold arrays of up to 65 536 000 000 elements or more.
回答1:
If you really must break the array length limit then you'll have to split the array into chunks of suitable size. You can wrap those chunks together in a container that has the appropriate semantics, like the BigArrayOfLong object that James McCaffrey blogged about a while back. There are numerous others like it.
The basic idea is you use a jagged array to allocate the space you're going to use. Note that a multi-dimensional array won't give you any advantage since it is still a single object, while a jagged array is a smaller array of arrays, each of which is its own object in (probably not contiguous) memory.
Here's a very simple (and not particular optimal) implementation:
public class HugeArray<T> : IEnumerable<T>
where T : struct
{
public static int arysize = (Int32.MaxValue >> 4) / Marshal.SizeOf<T>();
public readonly long Capacity;
private readonly T[][] content;
public T this[long index]
{
get
{
if (index < 0 || index >= Capacity)
throw new IndexOutOfRangeException();
int chunk = (int)(index / arysize);
int offset = (int)(index % arysize);
return content[chunk][offset];
}
set
{
if (index < 0 || index >= Capacity)
throw new IndexOutOfRangeException();
int chunk = (int)(index / arysize);
int offset = (int)(index % arysize);
content[chunk][offset] = value;
}
}
public HugeArray(long capacity)
{
Capacity = capacity;
int nChunks = (int)(capacity / arysize);
int nRemainder = (int)(capacity % arysize);
if (nRemainder == 0)
content = new T[nChunks][];
else
content = new T[nChunks + 1][];
for (int i = 0; i < nChunks; i++)
content[i] = new T[arysize];
if (nRemainder > 0)
content[content.Length - 1] = new T[nRemainder];
}
public IEnumerator<T> GetEnumerator()
{
return content.SelectMany(c => c).GetEnumerator();
}
IEnumerator System.Collections.IEnumerable.GetEnumerator() { return GetEnumerator(); }
}
This one is statically allocated, but it's not too hard to make one that grows to fit demand. Just make sure that the block size you specify isn't completely out of range. I've gone with a calculation based on the item size just in case.
回答2:
You could just avoid using real arrays and simulate them via a stream.
If you want it seekable (which you do), you're limited to long (2^64 / 2 (signed) bits)
Then you simply seek to index * n bytes and read n bytes.
If you use int32 or double (n=4) you have space for 2,8e+17 positions.
回答3:
This sounds like a problem for distributed computing, something like Google Map Reduce.
When it gets too big for your current infrastructure, scale it to more boxes.
回答4:
Well I am pretty sure you can't have an array with the size of 6500000000 because usaly it is bigger then the computer memory (no operation system would give a software that much memory.) And probably for another reason.
If for some reason you believe you can get that much ram but you think array is to small, you can try working with object that are based on linked list (like stack or even linked list itself).
Linked list is not limited by number of indexes (if it is in the range of your ram)
回答5:
I'm writing this as a solution but hoping someone will have better to offer that i can mark as the accepted answer.
A solution since the limit is on a dimention of an array would be to use multidimensional arrays and simply index in the multidimentional array by calculating positions as if it were a 1D array
//pseudocode
var index = some large number;
var index1 = index/sizeofarrays;
var index2 = index%sizeofarrays;
var data = myverylargemultidimentionalarray[index1,index2];
回答6:
My suggestion is to use native code (i.e. C++ x64) as C# is just not good to loop through such amount of elements. Think well what information you need to extract out of that data before attempting to load such amount of data to the RAM.
回答7:
Sounds like you should be using a stream to me. Memory stream should be fine as long as your dispose of chunks after you've read them.
My guess is that whatever populates your array is running way faster than what consumes it? If that is the case, you could use the stream to simply be a buffer. When the buffer reaches a critical mass, block new entries, whilst you clear the back log. Sounds like you've got enough memory for that to not be an issue.
Your buffer contents can be handed off in chunks to the parallel library with an index being maintained to give you the current index.
Pseudo-code being:
- receive new item and add to mega-memory stream (Memory will copy over to pagefile here so RAM even less of an issue if you've also got crazy amounts of disk!)
TASK THREAD (replicated for each Algorithm) :
- while buffer has items
- read object from buffer
- process object
If you want to leverage parallel processing within each task, stream a block of objects first and pass them to your method as a collection alongside a starting index so you can still deduce the current item index.