How to implement a single byte based string?
Application uses a large list of words.
Words come from SQL and is varchar (single byte).
Each word also has in Int32 ID.
Download the words to:
Dictionionary<Int32,string>
for performance.
Problem is the Dictionary gets so large that will get an out of memory exception.
We end up splitting up the data.
The app hits the list so much that hitting SQL for each request is not an option.
The database is already very active.
Dynamically paging into and out of the Dictionary is not an option - it is bound to ListView and with virtualiztion works great.
Words are only loaded at night - the user just needs a static list.
They use the words to search and process other data but they don't process the words.
Since it is char thought could just implement a single byte based word:
public class StringByte1252 : Object, IComparable, IComparable<StringByte1252>
{
static Encoding win1252 = Encoding.GetEncoding("Windows-1252");
public Int32 ID { get; private set; }
public byte[] Bytes { get; private set; }
public string Value { get { return win1252.GetString(Bytes); } }
public Int32 Length { get { return Bytes.Length; } }
public int CompareTo(object obj)
{
if (obj == null)
{
return 1;
}
StringByte1252 other = obj as StringByte1252;
if (other == null)
{
throw new ArgumentException("A StringByte1252 object is required for comparison.", "obj");
}
return this.CompareTo(other);
}
public int CompareTo(StringByte1252 other)
{
if (object.ReferenceEquals(other, null))
{
return 1;
}
return string.Compare(this.Value, other.Value, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(Object obj)
{
//Check for null and compare run-time types.
if (obj == null || !(obj is StringByte1252)) return false;
StringByte1252 item = (StringByte1252)obj;
return (this.Bytes == item.Bytes);
}
public override int GetHashCode() { return ID; }
public StringByte1252(Int32 id, byte[] bytes) { ID = id; Bytes = bytes; }
}
This above works but it is NOT more memory efficient than
Dictionionary<Int32,string>
Dictionary with Int16 based characters actually uses slightly less memory.
Where did I go wrong?
Does a byte array take more space than the sum of the bytes?
Is there a way to achieve single byte string?
A byte array definitly takes more/as much space as the sum of its contents, because of the way an array is built. Arrays are made of blocks of specific size. Do be able to differentiate between different elements of the array, there has to be a fixed size for the elements, so digits don't get confused. It is like saying you want to store decimal numbers up to 9999, so to be able to store them "together", you would have to fill the gaps with leading zero as such:1,5,32,1293,12 = 00010005003212930012. A word is made of characters. To be able to represent a character you need to find the smallest number of possible characters to use, and from that to define the basic construction unit of the array. As there are 26 characters in the alphabet, capitals and small ones are 52 together, and with other symbols you might get to less than 128 posibillities, leading you to choosing 7bits. The memory is made out of 8bit blocks - bytes, so you should either settle with these and encode with ASCII or find a way to manipulate the data to use only 7bits for each character, and by every 8th character save a byte. I suppose there are open solutions to it, though I know none.
A string is simply an array of characters. In c, strings are byte[]. Try using a byte[] array.
As I am running linux at the moment I find it hard to test, but you could also perhaps try and use:
Eventhough a char is twice the size of a byte, you would only get a substantial difference in memory footprint if the strings are quite long.
Memory is allocated in blocks that can be for example 16 bytes (may vary between platforms and implementations). That means that a string that is one character long may take up as much memory as a string that is six characters long, because both needs two memory blocks to hold the character data and the overhead of the string object.
With the overhead of the reference in the dictionary, the overhead in the string object and the overhead of partly unused memory blocks, you need to put about 16 characters in the string by average before you have less than 50% overhead.
With that much overhead, it's hard to reduce the memory footprint by only reducing the size of the data.
You might look for a solution where you have less overhead for each item, like for example one giant string (or byte array) for the character data, and specifying the starting index for each string inside that big string.
An array has approximately 50 bytes of overhead in the 64-bit runtime. In the 32-bit runtime, it's a little less: perhaps 40 bytes. There's the standard .NET allocation overhead (24 bytes in the 64-bit runtime), and then there's all the metadata for the array: number of dimensions, length, etc. You can't save memory by using individual byte arrays to store short strings.
One way is to allocate a very large array of bytes and store the strings in that array, UTF-8 encoded. Your dictionary becomes a
Dictionary<int,int>
, with theValue
being an index into the array.I showed how to do this in my article Reducing Memory Required for Strings. I was able to save about 50% over normal string allocation that way. See the article for more detail.
Another problem is that
Dictionary
overhead is something like 24 bytes per entry. That's pretty expensive if you have a whole bunch of small objects. You might consider instead making list of structures, sorting it by ID, and using binary search. It's not the O(1) access thatDictionary
gives you, but for user interface it could be plenty fast enough. Your overhead then would be 8 bytes per entry.The struct would be something like:
Came to the conclusion a simple struct is the best memory performer for storing char as single byte.
This is my assumption.
Memory is allocated 4 bytes at a time.
Bytes at the end of single variable or an array not on a 4 byte boundary are wasted.
A byte pool eliminates wasted bytes on individual words but at the cost of an index to the starting point of the pool and word length.
Assuming even Dictionionary Int32, string has waste.
Any odd length string will waste 16 bytes.
Consider the Int32 word index.
Int32 is a waste of 1 byte.
In the case of the pool the pool index is only Int32 so clearly it cannot hold Int32 words.
In the case of object size in .NET there is 4 GB limit.
Best case of word plus index is 8 bytes.
32(4GB) - 8 = 24
Max word plus index count is 2 ** 24 = 16,777,216.
This struct uses one byte from the index as one char.
A byte is a byte. The struct does not have to store the length as it can derive the length from the content.
How to pack
It hydrates fast. My test case is 6 million words and Word1252bytes builds just as fast as dictionary (about 20 seconds)
Comparison of size.
With a byte pool can use a byte of the word index for word length.
This limits the word length to 256 but saves spaced in the string pool.
The above struct win or ties at every words size with the stated assumptions. Index is the size of the index. For Dict and the above struct it is 32.
For byte pool it is 64 - the word index and the pool index.
Waste is bytes not on a 4 byte boundary.
Where the byte pool could compress is searching for the string already present in the pool. On small strings the pool is at a 30% disadvantage so it would need to have a high rate of matching. On larger strings the chance of finding a match is low. Problem there is search time. On list over 1,000,000 even a 10 ms search time is 10,000 seconds = 2.78 hours.