可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to convert a vector of signed char into a vector of unsigned char. I want to preserve the value range for each type.

I mean the value range of signed char is -128 and +127 when the value range of an unsigned char element is between 0 - 255.

Without intrinsics I can do this almost like that :

#include <iostream>

int main(int argc,char* argv[])
{

typedef signed char schar;
typedef unsigned char uchar;

schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

uchar b[32] = {0};

    for(int i=0;i<32;i++)
        b[i] = 0xFF & ~(0x7F ^ a[i]);

    return 0;

}

So using AVX2 I wrote the following program :

#include <immintrin.h>
#include <iostream>

int main(int argc,char* argv[])
{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);

    _a = _mm256_xor_si256(_a,_cst1);
    _a = _mm256_andnot_si256(_cst2,_a);

// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
    _b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
    _a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

    // convert from epi16 -> epu8.
    _b = _mm256_packus_epi16(_b,_a);

_mm256_stream_si256(reinterpret_cast<__m256i*>(b),_b);

return 0;
}

When I display the varaible b it was fully empty. I check also the following situations :

   #include <immintrin.h>
    #include <iostream>

    int main(int argc,char* argv[])

{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);


// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
    _b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
    _a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

    // convert from epi16 -> epu8.
    _b = _mm256_packus_epi16(_b,_a);

_b = _mm256_xor_si256(_b,_cst1);
_b = _mm256_andnot_si256(_cst2,_b);


_mm256_stream_si256(reinterpret_cast<__m256i*>(b),_b);

return 0;
}

and :

 #include <immintrin.h>
    #include <iostream>

    int main(int argc,char* argv[])

{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);


// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
_b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
_a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

_a = _mm256_xor_si256(_a,_cst1);
_a = _mm256_andnot_si256(_cst2,_a);

_b = _mm256_xor_si256(_b,_cst1);
_b = _mm256_andnot_si256(_cst2,_b);

_b = _mm256_packus_epi16(_b,_a);

_mm256_stream_si256(reinterpret_cast<__m256i*>(b[0]),_b);

return 0;
}

My investigation show me a part of the issue is related to the and_not operation. But I don't find why.

The variable b should contain the following sequence : [127, 126, 125, 132, 133, 134, 121, 120, 137, 138, 117, 140, 141, 142, 143, 144, 145, 0, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160].

Thanks in advance for any help.

回答1:

Yeah, the "andnot" definitely looks sketchy. Since _cst2 values are set to 0xFF, this operation will AND your _b vector with zero. I think you mixed up the order of arguments. It's the first argument that gets inverted. See the reference.

I don't understand the rest of the guff with conversions etc either. You just need this:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_xor_si256( _a, _mm256_set1_epi8( 0x7f ) );
_b = _mm256_andnot_si256( _b, _mm256_set1_epi8( 0xff ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

An alternative solution is to just add 128, but I'm not certain of the implications of overflow in this case:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_add_epi8( _a, _mm256_set1_epi8( 0x80 ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

One final important thing is that your a and b arrays MUST have 32-byte alignment. If you are using C++11 you can use alignas:

alignas(32) signed char a[32] = { -1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,
                                 -128,19,20,21,22,23,24,25,26,27,28,29,30,31,32 };
alignas(32) unsigned char b[32] = {0};

Otherwise you will need to use non-aligned load and store instructions, i.e. _mm256_loadu_si256 and _mm256_storeu_si256. But those don't have the same non-temporal cache properties as the stream instructions.

回答2:

You're just talking about adding 128 to each byte, right? That shifts the range from [-128..127] to [0..255]. The trick for adding 128 when you can only use 8bit operands is to subtract -128.

However, adding 0x80 works as well, when the result is truncated to 8 bits. (because of two's complement). Adding is good, because it doesn't matter which order the operands are in, so the compiler can use a load-and-add instruction (folding the memory operand into the load).

Adding/subtracting -128, with the carry/borrow stopped by the element boundary, is equivalent to xor (aka carryless add). Using pxor could be a small advantage on Intel Core2 through Broadwell, since Intel must have thought it was worth it to add paddb/w/d/q hardware on port0 for Skylake (giving them one per 0.333c throughput like pxor). (Thanks @harold for pointing this out). Both instructions only require SSE2.

XOR is also potentially useful for SWAR unaligned cleanup, or for SIMD architectures that don't have a byte-size add/subtract operation.

You shouldn't use _a for your variable name. _ names are reserved. I tend to use names like veca or va, and preferably something more descriptive for temporaries. (Like a_unpacked).

__m256i signed_bytes = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a));
__m256i unsigned_bytes = _mm256_add_epi8(signed_bytes, _mm256_set1_epi8(-128));

Yes, it's that simple, you don't need two's-complement bithacks. For one thing, your way needs two separate 32B masks, which increases your cache footprint. (But see What are the best instruction sequences to generate vector constants on the fly? You (or the compiler) could generate the vector of -128 bytes using 3 instructions, or a broadcast-load from a 4B constant.)

Only use _mm256_stream_load_si256 for I/O (e.g. reading from video RAM). Don't use it for reading from "normal" (writeback) memory; it doesn't do what you think it does. (I don't think it has any particular downside, though. It just works like a normal vmovdqa load). I put some links about that in another answer I wrote recently.

Streaming stores are useful to normal (writeback) memory regions. However, they're a good idea only if you're not going to read that memory again any time soon. If that's the case, you should probably do this conversion from signed to unsigned on the fly in the code that reads this data, because it's super-cheap. Just keep your data in one format or the other, and convert on the fly in code that needs it the other way. Only needing one copy of it in cache is a huge win compared to saving one instruction in some loops.

Also google "cache blocking" (aka loop tiling) and read about optimizing your code to work in small chunks to increase computational density. (Do as much stuff as possible with data while it's in cache.)