How to implement string with 1 byte char (and save

2019-09-06 15:18发布

How to implement a single byte based string?

Application uses a large list of words.
Words come from SQL and is varchar (single byte).
Each word also has in Int32 ID.
Download the words to:

Dictionionary<Int32,string> 

for performance.

Problem is the Dictionary gets so large that will get an out of memory exception.
We end up splitting up the data.
The app hits the list so much that hitting SQL for each request is not an option.
The database is already very active.
Dynamically paging into and out of the Dictionary is not an option - it is bound to ListView and with virtualiztion works great.
Words are only loaded at night - the user just needs a static list.
They use the words to search and process other data but they don't process the words.

Since it is char thought could just implement a single byte based word:

public class StringByte1252 : Object, IComparable, IComparable<StringByte1252>
{
    static Encoding win1252 = Encoding.GetEncoding("Windows-1252");

    public Int32 ID { get; private set; }
    public byte[] Bytes { get; private set; }

    public string Value { get { return win1252.GetString(Bytes); } }
    public Int32 Length { get { return Bytes.Length; } }

    public int CompareTo(object obj)
    {
        if (obj == null)
        {
            return 1;
        }
        StringByte1252 other = obj as StringByte1252;
        if (other == null)
        {
            throw new ArgumentException("A StringByte1252 object is required for comparison.", "obj");
        }
        return this.CompareTo(other);
    }
    public int CompareTo(StringByte1252 other)
    {
        if (object.ReferenceEquals(other, null))
        {
            return 1;
        }
        return string.Compare(this.Value, other.Value, StringComparison.OrdinalIgnoreCase);
    }
    public override bool Equals(Object obj)
    {
        //Check for null and compare run-time types.
        if (obj == null || !(obj is StringByte1252)) return false;
        StringByte1252 item = (StringByte1252)obj;
        return (this.Bytes == item.Bytes);
    }
    public override int GetHashCode() { return ID; }

    public StringByte1252(Int32 id, byte[] bytes) { ID = id; Bytes = bytes; } 
}

This above works but it is NOT more memory efficient than

Dictionionary<Int32,string>

Dictionary with Int16 based characters actually uses slightly less memory.

Where did I go wrong?
Does a byte array take more space than the sum of the bytes?
Is there a way to achieve single byte string?

4条回答
神经病院院长
2楼-- · 2019-09-06 15:52

A byte array definitly takes more/as much space as the sum of its contents, because of the way an array is built. Arrays are made of blocks of specific size. Do be able to differentiate between different elements of the array, there has to be a fixed size for the elements, so digits don't get confused. It is like saying you want to store decimal numbers up to 9999, so to be able to store them "together", you would have to fill the gaps with leading zero as such:1,5,32,1293,12 = 00010005003212930012. A word is made of characters. To be able to represent a character you need to find the smallest number of possible characters to use, and from that to define the basic construction unit of the array. As there are 26 characters in the alphabet, capitals and small ones are 52 together, and with other symbols you might get to less than 128 posibillities, leading you to choosing 7bits. The memory is made out of 8bit blocks - bytes, so you should either settle with these and encode with ASCII or find a way to manipulate the data to use only 7bits for each character, and by every 8th character save a byte. I suppose there are open solutions to it, though I know none.

A string is simply an array of characters. In c, strings are byte[]. Try using a byte[] array.

As I am running linux at the moment I find it hard to test, but you could also perhaps try and use:

byte[] bytes = Encoding.ASCII.GetBytes("Hello World!");
查看更多
聊天终结者
3楼-- · 2019-09-06 15:53

Eventhough a char is twice the size of a byte, you would only get a substantial difference in memory footprint if the strings are quite long.

Memory is allocated in blocks that can be for example 16 bytes (may vary between platforms and implementations). That means that a string that is one character long may take up as much memory as a string that is six characters long, because both needs two memory blocks to hold the character data and the overhead of the string object.

With the overhead of the reference in the dictionary, the overhead in the string object and the overhead of partly unused memory blocks, you need to put about 16 characters in the string by average before you have less than 50% overhead.

With that much overhead, it's hard to reduce the memory footprint by only reducing the size of the data.

You might look for a solution where you have less overhead for each item, like for example one giant string (or byte array) for the character data, and specifying the starting index for each string inside that big string.

查看更多
欢心
4楼-- · 2019-09-06 15:57

An array has approximately 50 bytes of overhead in the 64-bit runtime. In the 32-bit runtime, it's a little less: perhaps 40 bytes. There's the standard .NET allocation overhead (24 bytes in the 64-bit runtime), and then there's all the metadata for the array: number of dimensions, length, etc. You can't save memory by using individual byte arrays to store short strings.

One way is to allocate a very large array of bytes and store the strings in that array, UTF-8 encoded. Your dictionary becomes a Dictionary<int,int>, with the Value being an index into the array.

I showed how to do this in my article Reducing Memory Required for Strings. I was able to save about 50% over normal string allocation that way. See the article for more detail.

Another problem is that Dictionary overhead is something like 24 bytes per entry. That's pretty expensive if you have a whole bunch of small objects. You might consider instead making list of structures, sorting it by ID, and using binary search. It's not the O(1) access that Dictionary gives you, but for user interface it could be plenty fast enough. Your overhead then would be 8 bytes per entry.

The struct would be something like:

struct WordEntry
{
    public readonly int Id;
    public readonly int IndexIntoStringTable;
}
查看更多
相关推荐>>
5楼-- · 2019-09-06 15:58

Came to the conclusion a simple struct is the best memory performer for storing char as single byte.

This is my assumption.
Memory is allocated 4 bytes at a time.
Bytes at the end of single variable or an array not on a 4 byte boundary are wasted.

A byte pool eliminates wasted bytes on individual words but at the cost of an index to the starting point of the pool and word length.

Assuming even Dictionionary Int32, string has waste.
Any odd length string will waste 16 bytes.

Consider the Int32 word index.
Int32 is a waste of 1 byte.
In the case of the pool the pool index is only Int32 so clearly it cannot hold Int32 words.
In the case of object size in .NET there is 4 GB limit.
Best case of word plus index is 8 bytes.
32(4GB) - 8 = 24
Max word plus index count is 2 ** 24 = 16,777,216.

This struct uses one byte from the index as one char.
A byte is a byte. The struct does not have to store the length as it can derive the length from the content.

public struct Word1252bytes 
{
    static Encoding win1252 = Encoding.GetEncoding("Windows-1252");
    private UInt64 packed;
    private byte[] bytes;
    public Int32 Key { get { return (Int32)(packed & ((1 << 24) - 1)); } }
    private byte[] Bytes
    {
        get
        {
            // yes a lot of work to salage just one byte out of the key 
            // but a byte is a byte and the design objective is size
            byte[] bytesT = new byte[Length];
            bytesT[0] = (byte)((packed >> 24) & ((1 << 8) - 1));
            for (int i = 0; i < bytes.Length; i++) bytesT[i + 1] = bytes[i];
            return bytesT;
        }
    }
    public Int32 Length { get { return bytes.Length + 1; } }
    public String Value
    {
        get
        {
            return win1252.GetString(Bytes);
        }
    }
    public Word1252bytes(UInt64 Packed, byte[] Bytes) 
    { 
        packed = Packed;
        bytes = Bytes;
    }
}

How to pack

pack32 = (UInt32)(bits24wordI) | ((UInt32)charB[0] << 24);
byte[] bytesT = new byte[bits8wLen - 1];
for (int i = 1; i < bits8wLen; i++) bytesT[i - 1] = charB[i];
iWordsList.Add(new Word1252bytes(pack32, bytesT));

It hydrates fast. My test case is 6 million words and Word1252bytes builds just as fast as dictionary (about 20 seconds)

Comparison of size.
With a byte pool can use a byte of the word index for word length.
This limits the word length to 256 but saves spaced in the string pool.

The above struct win or ties at every words size with the stated assumptions. Index is the size of the index. For Dict and the above struct it is 32.
For byte pool it is 64 - the word index and the pool index.
Waste is bytes not on a 4 byte boundary.

        Index   Content Waste   Total
exactly 1               
dict16  dict16  32  16      16  64
word8   pool    64  8           72
word8   imbed   32  0       0   32

exactly 2               
dict16  dict16  32  32      0   64
word8   pool    64  16          80
word8   imbed   32  8       24  64

exactly 3               
dict16  dict16  32  48      16  96
word8   pool    64  24          88
word8   imbed   32  16      16  64

exactly 4               
dict16  dict16  32  64      0   96
word8   pool    64  32          96
word8   imbed   32  24      8   64

exactly 5               
dict16  dict16  32  80      16  128
word8   pool    64  40          104
word8   imbed   32  32  0   64

exactly 6               
dict 16 dict16  32  96      0   128
word 8  pool    64  48          112
word 8  imbed   32  40      24  96

exactly 7               
dict 16 dict16  32  112     16  160
word 8  pool    64  56          120
word 8  imbed   32  48      16  96

exactly 8               
dict 16 dict16  32  128     0   160
word 8  pool    64  64          128
word 8  imbed   32  56      8   96

exactly 9               
dict 16 dict16  32  144     16  192
word 8  pool    64  72          136
word 8  imbed   32  64      0   96

exactly 10              
dict 16 dict16  32  160      0  192
word 8  pool    64  80          144
word 8  imbed   32  72       24 128

exactly 254             
dict 16 dict16  32  4064    0   4096
word 8  pool    64  2032        2096
word 8  imbed   32  2024    24  2080

exactly 255             
dict 16 dict16  32  4080    16  4128
word 8  pool    64  2040        2104
word 8  imbed   32  2032    16  2080

exactly 256             
dict 16 dict16  32  4096    0   4128
word 8  pool    64  2048        2112
word 8  imbed   32  2040    8   2080

exactly 257             
dict 16 dict16  32  4112    16  4160
word 8  pool    64  2056        2120
word 8  imbed   32  2048    0   2080

Where the byte pool could compress is searching for the string already present in the pool. On small strings the pool is at a 30% disadvantage so it would need to have a high rate of matching. On larger strings the chance of finding a match is low. Problem there is search time. On list over 1,000,000 even a 10 ms search time is 10,000 seconds = 2.78 hours.

查看更多
登录 后发表回答