Declaring and indexing an integer array of qwords

I have a question regarding how to initialize an array in assembly. I tried:

.bss
#the array
unsigned:    .skip 10000
.data
#these are the values that I want to put in the array
par4:   .quad 500 
par5:   .quad 10
par6:   .quad 15

That's how I declared my string and the variables that I want to put it inside. This is how I tried to put them into the array:

movq $0 , %r8

movq par4 , %rax
movq %rax , unsigned(%r8)
incq %r8

movq par5 , %rax
movq %rax , unsigned(%r8)
incq %r8

movq par6 , %rax
movq %rax , unsigned(%r8)

I tried printing the elements to check if everything is okay, and only the last one prints okay, the other two have some weird values.

Maybe this is not the way I should declare and work with it?

First of all, unsigned is the name of a type in C, so it's a poor choice for an array. Let's call it arr instead.

You want to treat that block of space in the BSS as an array qword elements. So each element is 8 bytes. So you need to store to arr+0, arr+8, and arr+16. (The total size of your array is 10000 bytes, which is 10000/8 qwords).

But you're using %r8 as a byte offset, not a scaled-index. That's generally a good thing, all else equal; indexed addressing modes are slower in some cases on some CPUs. But the problem is you only increment it by 1 with inc, not with add $8, %r8.

So you're actually storing to arr+0, arr+1, and arr+2, with 8-byte stores that overlap each other, leaving just the least-significant byte of the last store. x86 is little-endian so the resulting contents of memory is effectively this, followed by the rest of the unwritten bytes that stay zero.

# static array that matches what you actually stored
arr: .byte 500 & 0xFF, 10, 15, 0, 0, 0, 0, 0, 0, 0, ...

You could of course just use .qword in the .data section to declare a static array with the contents you want. But with only the first 3 element non-zero, putting it in the BSS makes sense for one that large, instead of a having the OS page in the zeros from disk.

If you're going to fully unroll instead of using a loop over your 3-element qword array starting at par4, you don't need to increment a register at all. You also don't need the initializers to be in data memory, you can just use immediates because they all fit as 32-bit sign-extended.

  # these are assemble-time constants, not associated with a section
.equ par4, 500
.equ par5, 10
.equ par6, 15

.text  # already the default section but whatever

.globl _start
_start:
    movq    $par4, arr(%rip)            # use RIP-relative addressing when there's no register
    movq    $par5, arr+8(%rip)
    movq    $par6, arr+16(%rip)

    mov $60, %eax
    syscall               # Linux exit(0)

.bss
    arr:   .skip 10000

You can run that under GDB and examine memory to see what you get. (Compile it with gcc -nostdlib -static foo.s). In GDB, start the program with starti (to stop at the entry point), then single-step with si. Use x /4g &arr to dump the contents of memory at arr as an array of 4 qwords.

Or if you did want to use a register, might as well just loop a pointer instead of an index.

    lea     arr(%rip), %rdi           # or mov $arr, %edi in a non-PIE executable
    movq    $par4, (%rdi)
    add     $8, %rdi                  # advance the pointer 8 bytes = 1 element
    movq    $par5, (%rdi)
    add     $8, %rdi
    movq    $par6, (%rdi)

Or scaled-index:

## Scaled-index addressing
    movq    $par4, arr(%rip)
    mov     $1, %eax
    movq    $par5, arr(,%rax,8)       # [arr + rax*8]
    inc     %eax
    movq    $par6, arr(,%rax,8)

Fun trick: you could just do a byte store instead of a qword store to set the low byte, and leave the rest zero. This would save code-size but if you did a qword load right away, you'd get a store-forwarding stall. (~10 cycles extra latency for the store/reload to merge data from the cache with the store from the store buffer)

Or if you did still want to copy 24 bytes from par4 in .rodata, you could use SSE. x86-64 guarantees that SSE2 is available.

    movaps   par4(%rip), %xmm0
    movaps   %xmm0, arr(%rip)          # copy par4 and par5

    mov      par6(%rip), %rax          # aka par4+16
    mov      %rax, arr+16(%rip)

.section .rodata          # read-only data.
.p2align 4         # align by 2^4 = 16 for movaps
  par4:  .quad 500
  par5:  .quad 10
  par6:  .quad 15

.bss
.p2align 4        # align by 16 for movaps
  arr: .skip 10000
# or use .lcomm arr, 10000  without even switching to .bss

Or with SSE4.1, you can load+expand small constants so you don't need a whole qword for each small number that you're going to copy into the BSS array.

    movzxwq    initializers(%rip), %xmm0       # zero-extend 2 words into 2 qwords
    movaps     %xmm0, arr(%rip)
    movzwl     initializers+4(%rip), %eax      # zero-extending word load
    mov        %rax, arr+16(%rip)

.section .rodata
  initializers: .word 500, 10, 15