I'm trying to make a program in Assembly that checks two strings.
section .data
str1 db 'mystring'
str2 db 'mystring'
output db 'cmp went fine'
len equ $-output
section .text
global main
main:
mov ecx, str1
cmp ecx, str2
je ifBody0
int 80h
mov eax, 1
mov ebx, 0
int 80h
ifBody0:
mov eax, 4
mov ebx, 1
mov ecx, output
mov edx, outputlen
int 80h
The weird thing is that when I call the conditional jump: je [label]
, it doesn't work. But when I change je
to jne
, it works.
I would like to know what I'm doing wrong here.
Thanks in advance,
Daan
For comparing strings in x86-assembly there is a special OpCode named CMPS
(Compare Strings). In your case of BYTE strings the relevant OpCode is CMPSB
. You do use it by setting ESI
to the source string and EDI
to the destination string. The length of the equality check (preferrably the longest string) is set in ECX
. Beware of overflows!.
So your code could look like this:
section .data
str1 db 'mystring',0
str1len equ $-str1
str2 db 'mystring',0
output db 'cmp went fine',0x0a,0
outputlen equ $-output
output2 db 'cmp went wrong',0x0a,0
output2len equ $-output2
section .text
global main
main:
lea esi, [str1]
lea edi, [str2]
mov ecx, str1len ; selects the length of the first string as maximum for comparison
rep cmpsb ; comparison of ECX number of bytes
mov eax, 4 ; does not modify flags
mov ebx, 1 ; does not modify flags
jne ifWrong ; checks ZERO flag
ifRight: ; the two strings do match
mov ecx, output
mov edx, outputlen
int 80h
jmp exit
ifWrong: ; the two strings don't match
mov ecx, output2
mov edx, output2len
int 80h
exit: ; sane shutdown
mov eax, 1
mov ebx, 0
int 80h
Let's start with these two:
str1 db 'mystring'
mov ecx,str1
After compiling this with assembler, the raw bytes of machine code looks for example like this (this will become content of memory after loading the executable into it):
6D 79 73 74 72 69 6E 67 mystring
B9 00 00 00 00 ¹....
The last 4 zeroes are address of 'm' byte from 'mystring', as I decided it will be compiled at address 0. First 8 bytes are the string data (ASCII encoded), B9
is mov ecx,imm32
instruction opcode.
You can't put string into ecx
, ecx
is 32 bits wide (4 bytes), while string can have many bytes. So with ecx
you may fetch at most 4 bytes from string, but that would require mov ecx,DWORD [str1]
, that would put value 0x7473796D
into ecx
(x86 is little endian, so the first byte 6D
is the least significant in DWORD (32b) value).
But mov ecx,str1
loads ecx
with str1
symbol, which is address of the first 'm'
byte (0x00000000
).
To compare two strings you load both addresses into some registers, then load the bytes from those addresses, and compare them, one by one, till you find some difference (or end of string) (there are faster algorithms, but they are more complex and require you to know the length of string ahead, while byte by byte compare can work with C-like zero terminated strings easily).
Talking about length of string, you should somehow define one. In C it's common to put zero after last character of string (that would be ahead of B9
in that example), in C++ std::string
is structure holding the length as value for direct fetch/compare. Or you can have it hardcoded in source, like your outputlen
.
When you program in assembler, you should be always aware of how many bits you are processing, and choose correct register size (or extend the value), and correct memory buffer size, to handle the desired value.
With strings that means, that you have to decide upon encoding of strings. ASCII is 8 bit per char (1 byte), UTF-8 has variable number of bytes per glyph, early version of UTF-16 (UCS-2) had 2 bytes per glyph (as Java, but current Utf-16 is variable length), Utf-32 is fixed 4 bytes per glyph. So with ASCII encoded string to fetch it's first character means to do mov al,BYTE [str1]
(or mov ecx,str1
mov al,[ecx]
-> al = 6Dh = 'm'
) With Utf-32 to fetch second character you would have to do mov eax,DWORD [utf32str + 4]
. With Utf-8 the single character can have from 1 to 6 bytes at most IIRC, so you have to handle that in quite a complex way, to recognize valid utf-8 code and read correct number of bytes. But if you just want to know if two utf-8 strings are bit-equal, you can compare them byte by byte, without handling glyphs themselves.
Of course you should be aware of sizes of registers and on x86 the way how you can address sub-part of some registers, ie. like ax
part (lower 16b) out from whole eax
(32b), or how ah
:al
(high 8b : low 8b) form together ax
.
I hope you will understand after this, that you did compare two pointers (str1
vs str2
), which will always be unequal, as they point to different byte in memory. Instead of comparing the content in memory (strings).