16 is a power of 2. Unlike decimal (How do I print an integer in Assembly Level Programming without printf from the c library?) or other bases that aren't a power of 2, we don't need division, and we can extract the most-significant digit first instead of the least-significant digit and counting backwards.
Each 4-bit group of bits maps to one hex digit. We can use shifts or rotates, and AND masks, to extract each 4-bit chunk of the input as a 4-bit integer.
Unfortunately the 0..9 a..f hex digits are not contiguous in the ASCII character set (http://www.asciitable.com/). We either need conditional behaviour (a branch or cmov), or we can use a lookup table. A lookup table is typical the most efficient for instruction count and performance; modern CPUs have very fast L1d caches that make repeated loads of nearby bytes very cheap.
;; NASM syntax, i386 System V calling convention
global itohex
itohex: ; inputs: char* output, unsigned number
push edi ; save a call-preserved register for scratch space
mov edi, [esp+8] ; out pointer
mov eax, [esp+12] ; number
mov ecx, 8 ; 8 hex digits, fixed width zero-padded
.digit_loop: ; do {
rol eax, 4 ; rotate the high 4 bits to the bottom
mov edx, eax
and edx, 0x0f ; and isolate 4-bit integer in EDX
movzx edx, byte [hex_lut + edx]
mov [edi], dl ; copy a character from the lookup table
inc edi ; loop forward in the output buffer
dec ecx
jnz .digit_loop ; }while(--ecx)
pop edi
ret
section .rodata
hex_lut: db "0123456789abcdef"
Until BMI2 (shrx
/ rorx
), x86 lacks a copy-and-shift instruction, so rotating in-place and then copy/AND is hard to beat. Modern x86 (Intel and AMD) has 1-cycle latency for rotates (https://agner.org/optimize/), so this loop-carried dependency chain doesn't become a bottleneck. (There are too many instructions in the loop for it to run at even 1 cycle per iteration even on 5-wide Ryzen.)
Even if we optimized by using a cmp / jb
with an end pointer to enable cmp/jcc fusion on Ryzen, it's still 7 uops, more than the pipeline can handle in 1 cycle. dec/jcc
macro-fusion into a single uop only happens on Intel Sandybridge-family; AMD only fuses cmp or test with jcc. I used mov ecx,8
and dec/jnz for human readability; lea ecx, [edi+8]
and cmp edi, ecx / jb .digit_loop
is smaller overall, and more efficient on more CPUs.
Test program:
// hex.c converts argv[1] to integer and passes it to itohex
#include <stdio.h>
#include <stdlib.h>
void itohex(char buf[8], unsigned num);
int main(int argc, char**argv) {
unsigned num = strtoul(argv[1], NULL, 0); // allow any base
char buf[9] = {0};
itohex(buf, num); // writes the first 8 bytes of the buffer, leaving a 0-terminated C string
puts(buf);
}
compile with:
nasm -felf32 -g -Fdwarf itohex.asm
gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o
test runs:
$ ./a.out 12315
0000301b
$ ./a.out 12315123
00bbe9f3
$ ./a.out 999999999
3b9ac9ff
$ ./a.out 9999999999 # apparently glibc strtoul saturates on overflow
ffffffff
$ ./a.out 0x12345678 # strtoul with base=0 can parse hex input, too
12345678
Alternate implementations:
Conditional instead of lookup-table: takes several more instructions, and will typically be slower. But it doesn't need any static data. It could be done with branching instead of cmov
, but that would be even slower most of the time. (It won't predict well, assuming a random mix of 0..9 and a..f digits.)
Just for fun, this version starts at the end of the buffer and decrements a pointer. (And the loop condition uses a pointer-compare.) You could have it stop once EDX becomes zero, and use EDI+1 as the start of the number, if you don't want leading zeros.
Using a cmp eax,9
/ ja
instead of cmov
is left as an exercise for the reader. A 16-bit version of this could use different registers (like maybe BX as a temporary) to still allow lea cx, [bx + 'a'-10]
copy-and-add. Or just add
/cmp
and jcc
, if you want to avoid cmov
for compat with ancient CPUs that don't support P6 extensions.
;; NASM syntax, i386 System V calling convention
itohex: ; inputs: char* output, unsigned number
itohex_conditional:
push edi ; save a call-preserved register for scratch space
push ebx
mov edx, [esp+16] ; number
mov ebx, [esp+12] ; out pointer
lea edi, [ebx + 7] ; First output digit will be written at buf+7, then we count backwards
.digit_loop: ; do {
mov eax, edx
and eax, 0x0f ; isolate the low 4 bits in EAX
lea ecx, [eax + 'a'-10] ; possible a..f value
add eax, '0' ; possible 0..9 value
cmp ecx, 'a'
cmovae eax, ecx ; use the a..f value if it's in range.
; for better ILP, another scratch register would let us compare before 2x LEA,
; instead of having the compare depend on an LEA or ADD result.
mov [edi], al ; *ptr-- = c;
dec edi
shr edx, 4
cmp edi, ebx ; alternative: jnz on flags from EDX to not write leading zeros.
jae .digit_loop ; }while(ptr >= buf)
pop ebx
pop edi
ret
Check for off-by-1 errors by using a number that has both 9
and a
in its hex digits:
$ nasm -felf32 -g -Fdwarf itohex.asm && gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o && ./a.out 0x19a2d0fb
19a2d0fb
SIMD with SSE2, SSSE3, and AVX512
Most of these SIMD versions could be used with two packed 32-bit integers as input, with the low and high 8 bytes of the result vector containing separate results that you can store separately with movq
and movhps
.
Depending on your shuffle control, this is exactly like using it for one 64-bit integer.
SSSE3 pshufb
parallel lookup table. No need to mess around with loops, we can do this with a few SIMD operations, on CPUs that have pshufb
. (SSSE3 is not baseline even for x86-64; it was new with Intel Core2 and AMD Bulldozer).
pshufb
is a byte shuffle that's controlled by a vector, not an immediate (unlike all earlier SSE1/SSE2/SSE3 shuffles). With a fixed destination and a variable shuffle-control, we can use it as a parallel lookup table to do 16x lookups in parallel (from a 16 entry table of bytes in a vector).
So we load the whole integer into a vector register, and unpack its nibbles to bytes with a bit-shift and punpcklbw
. Then use a pshufb
to map those nibbles to hex digits.
That leaves us with the ASCII digits an XMM register with the least significant digit as the lowest byte of the register. Since x86 is little-endian, there's no free way to store them to memory in the opposite order, with the MSB first.
We can use an extra pshufb
to reorder the ASCII bytes into printing order, or use bswap
on the input in an integer register (and reverse the nibble -> byte unpacking). If the integer is coming from memory, going through an integer register for bswap
kinda sucks (especially for AMD Bulldozer-family), but if you have the integer in a GP register in the first place it's pretty good.
;; NASM syntax, i386 System V calling convention
section .rodata
hex_lut: db "0123456789abcdef"
low_nibble_mask: times 16 db 0x0f
reverse_8B: db 7,6,5,4,3,2,1,0, 15,14,13,12,11,10,9,8
;reverse_16B: db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
section .text
global itohex_ssse3 ; tested, works
itohex_ssse3:
mov eax, [esp+4] ; out pointer
movd xmm1, [esp+8] ; number
movdqa xmm0, xmm1
psrld xmm1, 4 ; right shift: high nibble -> low (with garbage shifted in)
punpcklbw xmm0, xmm1 ; interleave low/high nibbles of each byte into a pair of bytes
pand xmm0, [low_nibble_mask] ; zero the high 4 bits of each byte (for pshufb)
; unpacked to 8 bytes, each holding a 4-bit integer
movdqa xmm1, [hex_lut]
pshufb xmm1, xmm0 ; select bytes from the LUT based on the low nibble of each byte in xmm0
pshufb xmm1, [reverse_8B] ; printing order is MSB-first
movq [eax], xmm1 ; store 8 bytes of ASCII characters
ret
;; The same function for 64-bit integers would be identical with a movq load and a movdqu store.
;; but you'd need reverse_16B instead of reverse_8B to reverse the whole reg instead of each 8B half
It's possible to pack the AND mask and the pshufb control into one 16-byte vector, similar to itohex_AVX512F
below.
AND_shuffle_mask: times 8 db 0x0f ; low half: 8-byte AND mask
db 7,6,5,4,3,2,1,0 ; high half: shuffle constant that will grab the low 8 bytes in reverse order
Load it into a vector register and use it as an AND mask, then use it as a pshufb
control to grab the low 8 bytes in reverse order, leaving them in the high 8. Your final result (8 ASCII hex digits) will be in the top half of an XMM register, so use movhps [eax], xmm1
. On Intel CPUs, this is still only 1 fused-domain uop, so it's just as cheap as movq
. But on Ryzen, it costs a shuffle on top of a store. Plus, this trick is useless if you want to convert two integers in parallel, or a 64-bit integer.
SSE2, guaranteed available in x86-64:
Without SSSE3 pshufb
, we need to rely on scalar bswap
to put the bytes in printing right order, and punpcklbw
the other way to interleave with the high nibble of each pair first.
Instead of a table lookup, we simply add '0'
, and add another 'a' - ('0'+10)
for digits greater than 9 (to put them into the 'a'..'f'
range). SSE2 has a packed byte compare for greater-than, pcmpgtb
. Along with a bitwise AND, that's all we need to conditionally add something.
itohex: ; tested, works.
global itohex_sse2
itohex_sse2:
mov edx, [esp+8] ; number
mov ecx, [esp+4] ; out pointer
;; or enter here for fastcall arg passing. Or rdi, esi for x86-64 System V. SSE2 is baseline for x86-64
bswap edx
movd xmm0, edx
movdqa xmm1, xmm0
psrld xmm1, 4 ; right shift: high nibble -> low (with garbage shifted in)
punpcklbw xmm1, xmm0 ; interleave high/low nibble of each byte into a pair of bytes
pand xmm1, [low_nibble_mask] ; zero the high 4 bits of each byte
; unpacked to 8 bytes, each holding a 4-bit integer, in printing order
movdqa xmm0, xmm1
pcmpgtb xmm1, [vec_9]
pand xmm1, [vec_af_add] ; digit>9 ? 'a'-('0'+10) : 0
paddb xmm0, [vec_ASCII_zero]
paddb xmm0, xmm1 ; conditional add for digits that were outside the 0..9 range, bringing them to 'a'..'f'
movq [ecx], xmm0 ; store 8 bytes of ASCII characters
ret
;; would work for 64-bit integers with 64-bit bswap, just using movq + movdqu instead of movd + movq
section .rodata
align 16
vec_ASCII_zero: times 16 db '0'
vec_9: times 16 db 9
vec_af_add: times 16 db 'a'-('0'+10)
; 'a' - ('0'+10) = 39 = '0'-9, so we could generate this from the other two constants, if we were loading ahead of a loop
; 'A'-('0'+10) = 7 = 0xf >> 1. So we could generate this on the fly from an AND. But there's no byte-element right shift.
low_nibble_mask: times 16 db 0x0f
This version needs more vector constants than most others. 4x 16 bytes is 64 bytes, which fits in one cache line. You might want to align 64
before the first vector instead of just align 16
, so they all come from the same cache line.
This could even be implemented with only MMX, using only 8-byte constants, but then you'd need an emms
so it would probably only be a good idea on very old CPUs which don't have SSE2, or which split 128-bit operations into 64-bit halves (e.g. Pentium-M or K8). On modern CPUs with mov-elimination for vector registers (like Bulldozer and IvyBrige), it only works on XMM registers, not MMX. I did arrange the register usage so the 2nd movdqa
is off the critical path, but I didn't do that for the first.
AVX can save a movdqa
, but more interesting is with AVX2 we can potentially produce 32 bytes of hex digits at a time from large inputs. 2x 64-bit integers or 4x 32-bit integers; use a 128->256-bit broadcast load to replicate the input data into each lane. From there, in-lane vpshufb ymm
with a control vector that read from the low or high half of each 128-bit lane should set you up with the nibbles for the low 64 bits of input unpacked in the low lane, and the nibbles for the high 64 bits of input unpacked in the high lane.
Or if the input numbers come from different sources, maybe vinserti128
the high one might be worth it on some CPUs, vs. just doing separate 128-bit operations.
AVX512VBMI (Cannonlake, not present in Skylake-X) has a 2-register byte shuffle vpermt2b
that could combine the puncklbw
interleaving with byte-reversing. Or even better, we have VPMULTISHIFTQB
which can extract 8 unaligned bytes from each qword of the source. We can use this to extract the nibbles we want into the order we want directly, avoiding a separate right-shift instruction. (It still comes with garbage bits, though.)
To use this for 64-bit integers, use a broadcast source and a control vector that grabs the high 32 bits of the input qword in the bottom of the vector, and the low 32 bits in the top of the vector.
To use this for more than 64 bits of input, use vpmovzxdq
to zero-extend each input dword into a qword, setting up for vpmultishiftqb
with the same 28,24,...,4,0 control pattern in each qword. (e.g. producing a zmm vector of output from a 256-bit vector of input, or four dwords -> a ymm reg to avoid clock-speed limits and other effects of actually running a 512-bit AVX512 instruction.)
itohex_AVX512VBMI: ; and AVX1. Tested with SDE
vmovq xmm1, [multishift_control]
vpmultishiftqb xmm0, xmm1, qword [esp+8]{1to2} ; number, plus 4 bytes of garbage. Or a 64-bit number
mov ecx, [esp+4] ; out pointer
;; VPERMB ignores high bits of the selector byte, unlike pshufb which zeroes if the high bit is set
;; and it takes the bytes to be shuffled as the optionally-memory operand, not the control
vpermb xmm1, xmm0, [hex_lut] ; use the low 4 bits of each byte as a selector
vmovq [ecx], xmm1 ; store 8 bytes of ASCII characters
ret
;; For 64-bit integers: vmovdqa load [multishift_control], and use a vmovdqu store.
section .rodata
align 16
hex_lut: db "0123456789abcdef"
multishift_control: db 28, 24, 20, 16, 12, 8, 4, 0
; 2nd qword only needed for 64-bit integers
db 60, 56, 52, 48, 44, 40, 36, 32
# I don't have an AVX512 CPU, so I used Intel's Software Development Emulator
$ /opt/sde-external-8.4.0-2017-05-23-lin/sde -- ./a.out 0x1235fbac
1235fbac
vpermb xmm
is not lane-crossing because there's only one lane involved (unlike vpermb ymm
or zmm). But unfortunately on CannonLake (according to instlatx64 results), it still has 3-cycle latency so pshufb
would be better for latency. But pshufb
requires masking the control vector so it's worse for throughput, assuming vpermb
is only 1 uop. In a loop where we can keep the vector constants in registers (instead of memory operands), it only saves 1 instruction instead of 2.
Or with AVX512F, we can use merge-masking to right-shift one dword while leaving the other unmodified, after broadcasting the number into an XMM register. Then we only need a single-register byte shuffle, vpshufb
, to interleave nibbles and byte-reverse. But then you need a constant in a mask register which takes a couple instructions to create. It would be a bigger win in a loop converting multiple integers to hex.
For a non-looping stand-alone version of the function, I used two halves of one 16-byte constant for different things: set1_epi8(0x0f)
in the top half, and 8 bytes of pshufb
control vector in the low half. This doesn't save a lot because EVEX broadcast memory operands allow vpandd xmm0, xmm0, dword [AND_mask]{1to4}
, only requiring 4 bytes of space for a constant.
itohex_AVX512F: ;; and AVX1. Saves a pshufb. tested with SDE
vpbroadcastd xmm0, [esp+8] ; number. can't use a broadcast memory operand for vpsrld because we need merge-masking into the old value
mov edx, 1<<3 ; element #3
kmovd k1, edx
vpsrld xmm0{k1}, xmm0, 4 ; top half: low dword: low nibbles unmodified (merge masking). 2nd dword: high nibbles >> 4
vmovdqa xmm2, [nibble_interleave_AND_mask]
vpand xmm0, xmm0, xmm2 ; zero the high 4 bits of each byte (for pshufb), in the top half
vpshufb xmm0, xmm0, xmm2 ; interleave nibbles from the high two dwords into the low qword of the vector
vmovdqa xmm1, [hex_lut]
vpshufb xmm1, xmm1, xmm0 ; select bytes from the LUT based on the low nibble of each byte in xmm0
mov ecx, [esp+4] ; out pointer
vmovq [ecx], xmm1 ; store 8 bytes of ASCII characters
ret
section .rodata
align 16
;hex_lut: db "0123456789abcdef"
nibble_interleave_AND_mask: db 15,11, 14,10, 13,9, 12,8 ; shuffle constant that will interleave nibbles from the high half
times 8 db 0x0f ; high half: 8-byte AND mask