Assembly x86: comparing strings doesn't work

2019-09-18 12:19发布

问题:

I'm trying to make a program in Assembly that checks two strings.

section .data
str1 db 'mystring'
str2 db 'mystring'

output db 'cmp went fine'
len equ $-output

section .text
global main

main:
    mov ecx, str1
    cmp ecx, str2
    je ifBody0
    int 80h

    mov eax, 1
    mov ebx, 0
    int 80h

ifBody0:
    mov eax, 4
    mov ebx, 1
    mov ecx, output
    mov edx, outputlen
    int 80h

The weird thing is that when I call the conditional jump: je [label], it doesn't work. But when I change je to jne, it works. I would like to know what I'm doing wrong here.

Thanks in advance, Daan

回答1:

For comparing strings in x86-assembly there is a special OpCode named CMPS(Compare Strings). In your case of BYTE strings the relevant OpCode is CMPSB. You do use it by setting ESI to the source string and EDI to the destination string. The length of the equality check (preferrably the longest string) is set in ECX. Beware of overflows!.

So your code could look like this:

section .data
str1 db 'mystring',0
str1len equ $-str1
str2 db 'mystring',0

output db 'cmp went fine',0x0a,0
outputlen equ $-output
output2 db 'cmp went wrong',0x0a,0
output2len equ $-output2

section .text
global main

main:
    lea esi, [str1]
    lea edi, [str2]
    mov ecx, str1len  ; selects the length of the first string as maximum for comparison
    rep cmpsb         ; comparison of ECX number of bytes
    mov eax, 4        ; does not modify flags 
    mov ebx, 1        ; does not modify flags 
    jne ifWrong       ; checks ZERO flag

ifRight:              ; the two strings do match
    mov ecx, output
    mov edx, outputlen
    int 80h
    jmp exit
ifWrong:              ; the two strings don't match
    mov ecx, output2
    mov edx, output2len
    int 80h
exit:                 ; sane shutdown
    mov eax, 1
    mov ebx, 0
    int 80h


回答2:

Let's start with these two:

str1    db 'mystring'
        mov ecx,str1

After compiling this with assembler, the raw bytes of machine code looks for example like this (this will become content of memory after loading the executable into it):

6D 79 73 74 72 69 6E 67  mystring
B9 00 00 00 00           ¹....

The last 4 zeroes are address of 'm' byte from 'mystring', as I decided it will be compiled at address 0. First 8 bytes are the string data (ASCII encoded), B9 is mov ecx,imm32 instruction opcode.

You can't put string into ecx, ecx is 32 bits wide (4 bytes), while string can have many bytes. So with ecx you may fetch at most 4 bytes from string, but that would require mov ecx,DWORD [str1], that would put value 0x7473796D into ecx (x86 is little endian, so the first byte 6D is the least significant in DWORD (32b) value).

But mov ecx,str1 loads ecx with str1 symbol, which is address of the first 'm' byte (0x00000000).

To compare two strings you load both addresses into some registers, then load the bytes from those addresses, and compare them, one by one, till you find some difference (or end of string) (there are faster algorithms, but they are more complex and require you to know the length of string ahead, while byte by byte compare can work with C-like zero terminated strings easily).

Talking about length of string, you should somehow define one. In C it's common to put zero after last character of string (that would be ahead of B9 in that example), in C++ std::string is structure holding the length as value for direct fetch/compare. Or you can have it hardcoded in source, like your outputlen.

When you program in assembler, you should be always aware of how many bits you are processing, and choose correct register size (or extend the value), and correct memory buffer size, to handle the desired value.

With strings that means, that you have to decide upon encoding of strings. ASCII is 8 bit per char (1 byte), UTF-8 has variable number of bytes per glyph, early version of UTF-16 (UCS-2) had 2 bytes per glyph (as Java, but current Utf-16 is variable length), Utf-32 is fixed 4 bytes per glyph. So with ASCII encoded string to fetch it's first character means to do mov al,BYTE [str1] (or mov ecx,str1 mov al,[ecx] -> al = 6Dh = 'm') With Utf-32 to fetch second character you would have to do mov eax,DWORD [utf32str + 4]. With Utf-8 the single character can have from 1 to 6 bytes at most IIRC, so you have to handle that in quite a complex way, to recognize valid utf-8 code and read correct number of bytes. But if you just want to know if two utf-8 strings are bit-equal, you can compare them byte by byte, without handling glyphs themselves.

Of course you should be aware of sizes of registers and on x86 the way how you can address sub-part of some registers, ie. like ax part (lower 16b) out from whole eax (32b), or how ah:al (high 8b : low 8b) form together ax.


I hope you will understand after this, that you did compare two pointers (str1 vs str2), which will always be unequal, as they point to different byte in memory. Instead of comparing the content in memory (strings).