Fast bit shift of a byte array - CMAC subkeys

I need to implement as fast as possible left bit shift of a 16-byte array in JavaCard.

I tried this code:

private static final void rotateLeft(final byte[] output, final byte[] input) {
         short carry = 0;
         short i = (short) 16;
         do {
             --i;
             carry = (short)((input[i] << 1) | carry);
             output[i] = (byte)carry;
             carry = (short)((carry >> 8) & 1);
         } while (i > 0);
}

Any ideas how to improve the performace? I was thinking about some Util.getShort(...) and Util.setShort(...) magic, but I did not manage to make it work faster then the implementation above.

This is one part of CMAC subkeys computation and it is done quite often, unfortunately. In case you know some faster way to compute CMAC subkeys (both subkeys in one loop or something like that), please, let me know.

标签： performance cryptography bit-manipulation bit-shift javacard

3条回答

smile是对你的礼貌

2楼-- · 2019-03-27 15:44

When it comes for speed, known length, hard-coded version is the fastest (but ugly). If you need to shift more than one bit, ensure to update the code accordingly.

output[0] = (byte)((byte)(input[0] << 1) | (byte)((input[1] >> 7) & 1));
output[1] = (byte)((byte)(input[1] << 1) | (byte)((input[2] >> 7) & 1));
output[2] = (byte)((byte)(input[2] << 1) | (byte)((input[3] >> 7) & 1));
output[3] = (byte)((byte)(input[3] << 1) | (byte)((input[4] >> 7) & 1));
output[4] = (byte)((byte)(input[4] << 1) | (byte)((input[5] >> 7) & 1));
output[5] = (byte)((byte)(input[5] << 1) | (byte)((input[6] >> 7) & 1));
output[6] = (byte)((byte)(input[6] << 1) | (byte)((input[7] >> 7) & 1));
output[7] = (byte)((byte)(input[7] << 1) | (byte)((input[8] >> 7) & 1));
output[8] = (byte)((byte)(input[8] << 1) | (byte)((input[9] >> 7) & 1));
output[9] = (byte)((byte)(input[9] << 1) | (byte)((input[10] >> 7) & 1));
output[10] = (byte)((byte)(input[10] << 1) | (byte)((input[11] >> 7) & 1));
output[11] = (byte)((byte)(input[11] << 1) | (byte)((input[12] >> 7) & 1));
output[12] = (byte)((byte)(input[12] << 1) | (byte)((input[13] >> 7) & 1));
output[13] = (byte)((byte)(input[13] << 1) | (byte)((input[14] >> 7) & 1));
output[14] = (byte)((byte)(input[14] << 1) | (byte)((input[15] >> 7) & 1));
output[15] = (byte)(input[15] << 1);

And use RAM byte array!

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-03-27 15:50

It might help to cache CMAC subkeys when signing repeatedly using the same key (i.e. the same DESFire EV1 session key). The subkeys are always the same for the given key.

I think David's answer could be even faster if it used two local variables to cache the values read twice from the same offset of the input array (from my observations on JCOP, the array access is quite expensive even for transient arrays).

EDIT: I can provide the following implementation which does 4 bit right shift using short (32-bit int variant for cards supporting it would be even faster):

short pom=0; // X000 to be stored next
short pom2; // loaded value
short pom3; // 0XXX to be stored next
short curOffset=PARAMS_TRACK2_OFFSET;
while(curOffset<16) {
    pom2=Util.getShort(mem_PARAMS, curOffset);
    pom3=(short)(pom2>>>4);
    curOffset=Util.setShort(mem_RAM, curOffset, (short)(pom|pom3));
    pom=(short)(pom2<<12);
}

Beware, this code assumes same offsets in source and destination.

You can unroll this loop and use constant parameters if desired.

0人赞添加讨论(0) 举报

男人必须洒脱

4楼-- · 2019-03-27 16:05

This is the fastest algorithm to rotate arbitrary number of bits I could come up with (I rotate array of 8 bytes, you can easily transform it to shifting 16 instead):

Use EEPROM to create a masking table for your shifts. Mask is just increasing amounts of 1s from the right:

final static byte[] ROTL_MASK = {
    (byte) 0x00, //shift 0: 00000000 //this one is never used, we don't do shift 0.
    (byte) 0x01, //shift 1: 00000001
    (byte) 0x03, //shift 2: 00000011
    (byte) 0x07, //shift 3: 00000111
    (byte) 0x0F, //shift 4: 00001111
    (byte) 0x1F, //shift 5: 00011111
    (byte) 0x3F, //shift 6: 00111111
    (byte) 0x7F  //shift 7: 01111111
};

Then you first use Util.arrayCopyNonAtomic for quick swap of bytes if shift is larger than 8:

final static byte BITS = 8;
//swap whole bytes:
Util.arrayCopyNonAtomic(in, (short) (shift/BITS), out, (short) 0, (short) (8-(shift/BITS)));
Util.arrayCopyNonAtomic(in, (short) 0, out, (short) (8-(shift/BITS)), (short) (shift/BITS));
shift %= BITS; //now we need to shift only up to 8 remaining bits

if (shift > 0) {
    //apply masks
    byte mask = ROTL_MASK[shift];
    byte comp = (byte) (8 - shift);

    //rotate using masks
    out[8] = in[0]; // out[8] is any auxiliary variable, careful with bounds!
    out[0] = (byte)((byte)(in[0] << shift) | (byte)((in[1] >> comp) & mask));
    out[1] = (byte)((byte)(in[1] << shift) | (byte)((in[2] >> comp) & mask));
    out[2] = (byte)((byte)(in[2] << shift) | (byte)((in[3] >> comp) & mask));
    out[3] = (byte)((byte)(in[3] << shift) | (byte)((in[4] >> comp) & mask));
    out[4] = (byte)((byte)(in[4] << shift) | (byte)((in[5] >> comp) & mask));
    out[5] = (byte)((byte)(in[5] << shift) | (byte)((in[6] >> comp) & mask));
    out[6] = (byte)((byte)(in[6] << shift) | (byte)((in[7] >> comp) & mask));
    out[7] = (byte)((byte)(in[7] << shift) | (byte)((in[8] >> comp) & mask));
}

You can additionally remove mask variable and use direct reference to the table instead.

Using this rather than naive implementation of bit-wise rotation proved to be about 450% - 500% faster.

0人赞添加讨论(0) 举报

Fast bit shift of a byte array - CMAC subkeys

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间