Gcc inline assembly: what's wrong with the dyn

When I test the GCC inline-assembly, I use the test function to display a character on the screen with the BOCHS emulator. This code is running in 32-bit protected mode. The code is as follows:

test() {
    char ch = 'B';
    __asm__ ("mov $0x10, %%ax\n\t" 
                "mov %%ax, %%es\n\t"
                "movl $0xb8000, %%ebx\n\t"
                "mov $0x04, %%ah\n\t" 
                "mov %0, %%al\n\t" 
                "mov %%ax, %%es: ((80 * 3 + 40) * 2)(%%ebx)\n\t" 
                ::"r"(ch):);
}

The result I'm getting is:

The red character on the screen isn't displaying B correctly. However, when I changed the input register r to c like this: ::"c"(ch):);, which is the last line of the above code, the character 'B' displays normally:

What's the difference? I accessed the video memory through the data segment directly after the computer entered into protected mode.

I have trace the assembly code, I have found that the code has been assembled to mov al, al when the r register is chosen and the value of ax is 0x0010, so al is 0x10. The result should be like this, but why did it choose the al register. Isn't it supposed to choose the register which hasn't been used before? When I add the clobbers list, I have solved the problem.

Like @MichaelPetch commented, you can use 32bit addresses to access whatever memory you want from C. The asm gcc emits will assume a flat memory space, and assume that it can copy esp to edi and use rep stos to zero some stack memory, for example (this requires that %es has the same base as %ss).

I'd guess that the best solution is not to use any inline asm, but instead just use a global constant as a pointer to char. e.g.

// pointer is constant, but points to non-const memory
uint16_t *const vga_base = (uint16_t*)0xb8000;   // + whatever was in your segment

// offsets are scaled by 2.  Do some casting if you want the address math to treat offsets as byte offsets
void store_in_flat_memory(unsigned char c, uint32_t offset) {
  vga_base[offset] = 0x0400U | c;            // it matters that c is unsigned, so it zero-extends instead of sign-extending
}
    movzbl  4(%esp), %eax       # c, c
    movl    8(%esp), %edx       # offset, offset
    orb     $4, %ah   #, tmp95         # Super-weird, wtf gcc.  We get this even for -mtune=core2, where it causes a partial-register stall
    movw    %ax, 753664(%edx,%edx)  # tmp95, *_3   # the addressing mode scales the offset by two (sizeof(uint16_t)), by using it as base and index
    ret

From gcc6.1 on godbolt (link below), with -O3 -m32.

Without the const, code like vga_base[10] = 0x4 << 8 | 'A'; would have to load the vga_base global and then offset from it. With the const, &vga_base[10] is a compile-time constant.

If you really want a segment:

Since you can't leave %es modified, you need to save/restore it. This is another reason to avoid using it in the first place. If you really want a special segment for something, set up %fs or %gs once and leave them set, so it doesn't affect the normal operation of any instructions that don't use a segment override.

There is builtin syntax to use %fs or %gs without inline asm, for thread-local variables. You might be able to take advantage of it to avoid inline asm altogether

If you're using a custom segment, you could make it's base address non-zero, so you don't need to add a 0xb8000 yourself. However, Intel CPUs optimize for flat memory case, so address-generation using non-zero segment bases are a couple cycles slower, IIRC.

I did find a request for gcc to allow segment overrides without inline asm, and a question about adding segment support to gcc. Currently you can't do that.

Doing it manually in asm, with a dedicated segment

To look at the asm output, I put it on Godbolt with the -mx32 ABI, so args are passed in registers, but addresses don't need to be sign-extended to 64bits. (I wanted to avoid the noise of loading args from the stack for -m32 code. The -m32 asm for protected mode will look similar)

void store_in_special_segment(unsigned char c, uint32_t offset) {
    char *base = (char*)0xb8000;               // sizeof(char) = 1, so address math isn't scaled by anything

    // let the compiler do the address math at compile time, instead of forcing one 32bit constant into a register, and another into a disp32
    char *dst = base+offset;               // not a real address, because it's relative to a special segment.  We're using a C pointer so gcc can take advantage of whatever addressing mode it wants.
    uint16_t val = (uint32_t)c | 0x0400U;  // it matters that c is unsigned, so it zero-extends

    asm volatile ("movw  %[val], %%fs: %[dest]\n"
         : 
         : [val] "ri" (val),  // register or immediate
           [dest] "m" (*dst)
         : "memory"   // we write to something that isn't an output operand
    );
}
    movzbl  %dil, %edi        # dil is the low 8 of %edi (AMD64-only, but 32bit code prob. wouldn't put a char there in the first place)
    orw     $1024, %di        #, val   # gcc causes an LCP stall, even with -mtune=haswell, and with gcc 6.1
    movw  %di, %fs: 753664(%esi)    # val, *dst_2

void test_const_args(void) {
    uint32_t offset = (80 * 3 + 40) * 2;
    store_in_special_segment('B', offset);
}
    movw  $1090, %fs: 754224        #, MEM[(char *)754224B]

void test_const_offset(char ch) {
    uint32_t offset = (80 * 3 + 40) * 2;
    store_in_special_segment(ch, offset);
}
    movzbl  %dil, %edi  # ch, ch
    orw     $1024, %di        #, val
    movw  %di, %fs: 754224  # val, MEM[(char *)754224B]

void test_const_char(uint32_t offset) {
    store_in_special_segment('B', offset);
}
    movw  $1090, %fs: 753664(%edi)  #, *dst_4

So this code gets gcc to do an excellent job at using an addressing mode to do the address math, and do as much as possible at compile time.

Segment register

If you do want to modify a segment register for every store, keep in mind that it's slow: Agner Fog's insn tables stop including mov sr, r after Nehalem, but on Nehalem it's a 6 uop instruction that includes 3 load uops (from the GDT I assume). It has a throughput of one per 13 cycles. Reading a segment register is fine (e.g. push sr or mov r, sr). pop sr is even a bit slower.

I'm not even going to write code for this, because it's such a bad idea. Make sure you use clobber constraints to let the compiler know about every register you step on, or you will have hard-to-debug errors where surrounding code stops working.

See the x86 tag wiki for GNU C inline asm info.