I'm trying to encode a binary file into base64.
Althrough, I'm stuck at the few steps and I'm also not sure if this is the way to think, see commentaries in code below :
SECTION .bss ; Section containing uninitialized data
BUFFLEN equ 6 ; We read the file 6 bytes at a time
Buff: resb BUFFLEN ; Text buffer itself
SECTION .data ; Section containing initialised data
B64Str: db "000000"
B64LEN equ $-B64Str
Base64: db "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
SECTION .text ; Section containing code
global _start ; Linker needs this to find the entry point!
_start:
nop ; This no-op keeps gdb happy...
; Read a buffer full of text from stdin:
Read:
mov eax,3 ; Specify sys_read call
mov ebx,0 ; Specify File Descriptor 0: Standard Input
mov ecx,Buff ; Pass offset of the buffer to read to
mov edx,BUFFLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
mov ebp,eax ; Save # of bytes read from file for later
cmp eax,0 ; If eax=0, sys_read reached EOF on stdin
je Done ; Jump If Equal (to 0, from compare)
; Set up the registers for the process buffer step:
mov esi,Buff ; Place address of file buffer into esi
mov edi,B64Str ; Place address of line string into edi
xor ecx,ecx ; Clear line string pointer to 0
;;;;;;
GET 6 bits from input
;;;;;;
;;;;;;
Convert to B64 char
;;;;;;
;;;;;;
Print the char
;;;;;;
;;;;;;
process to the next 6 bits
;;;;;;
; All done! Let's end this party:
Done:
mov eax,1 ; Code for Exit Syscall
mov ebx,0 ; Return a code of zero
int 80H ; Make kernel call
So, in text, it should do that :
1) Hex value :
7C AA 78
2) Binary value :
0111 1100 1010 1010 0111 1000
3) Groups in 6 bits :
011111 001010 101001 111000
4) Convert to numbers :
31 10 41 56
5) Each number is a letter, number or symbol :
31 = f
10 = K
41 = p
56 = 4
So, final output is : fKp4
So, my questions are :
How to get the 6 bits and how to convert those bits in char ?
You have two major ways how to implement it, either by generic loop capable to pick any 6 bits, or by having fixed code dealing with 24 bits (3 bytes) of input (will produce exactly 4 base64 characters and end at byte-boundary, so you can read next 24bits from +3 offset).
Let's say you have esi
pointing into source binary data, which are padded enough with zeroes to make abundant memory access beyond input buffer safe (+3 bytes at worst case).
And edi
pointing to some output buffer (having at least ((input_length+2)/3*4) bytes, maybe with some padding as B64 requires for ending sequence).
; convert 3 bytes of input into four B64 characters of output
mov eax,[esi] ; read 3 bytes of input
; (reads actually 4B, 1 will be ignored)
add esi,3 ; advance pointer to next input chunk
bswap eax ; first input byte as MSB of eax
shr eax,8 ; throw away the 1 junk byte (LSB after bswap)
; produce 4 base64 characters backward (last group of 6b is converted first)
; (to make the logic of 6b group extraction simple: "shr eax,6 + and 0x3F)
mov edx,eax ; get copy of last 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (4th)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bl,[Base64+edx] ; convert 0-63 value into B64 character (3rd)
shl ebx,16 ; make room in ebx for next character (4+3 in upper 32b)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (2nd)
; here eax contains exactly only 6 bits (zero extended to 32b)
mov bl,[Base64+eax] ; convert 0-63 value into B64 character (1st)
mov [edi],ebx ; store four B64 characters as output
add edi,4 ; advance output pointer
After the last group of 3B input you must overwrite last output with proper amount of '='
to fix the fake zeroes outputted. I.e. input 1B (needs 8 bits, 2x B64 chars) => output ends with '=='
, 2B input (needs 16b, 3x B64 char) => ends '='
, 3B input => full 24bits used => valid 4x B64 char.
If you don't want to read whole file into memory and produce whole output buffer in memory, you can make the in/out buffer of limited length, like only 900B input -> 1200B output, and process input in 900B blocks. Or you can use 3B -> 4B in/out buffer, then remove the pointer advancing completely (or even esi/edi
usage, and use fixed memory), as you will have to load/store in/out for every iteration separately then.
Disclaimer: this code is written to be straightforward, not performant, as you asked how to extract 6 bits and how to convert value into character, so I guess staying with the basic x86 asm instructions is best.
I'm not even sure how to make it perform better without profiling the code for bottlenecks and experimenting with other variants. Surely the partial register usage (bh, bl vs ebx
) will be costly, so there's very likely better solution (or maybe even some SIMD optimized version for larger input block).
And I didn't debug that code, just written in here in answer, so proceed with caution and check in debugger how/if it works.