i cant figure out a way to move code from one location to other in memory
so i put in a way some thing like this but it doesn't work
extern _transfer_code_segment
extern _kernel_segment
extern _kernel_reloc
extern _kernel_reloc_segment
extern _kernel_para_size
section .text16
global transfer_to_kernel
transfer_to_kernel:
;cld
;
; Turn off interrupts -- the stack gets destroyed during this routine.
; kernel must set up its own stack.
;
;cli
; stack for only for this function
push ebp
mov ebp, esp
mov eax, _kernel_segment ; source segment
mov ebx, _kernel_reloc_segment ; dest segment
mov ecx, _kernel_para_size
.loop:
; XXX: Will changing the segment registers this many times have
; acceptable performance?
mov ds, eax ;this the place where the error
mov es, ebx ; this to
xor esi, esi
xor edi, edi
movsd
movsd
movsd
movsd
inc eax
inc ebx
dec ecx
jnz .loop
leave
ret
do have any other way to do it or how can i solve this problem
That will have horrible performance. Agner Fog says mov sr, r
has one per 13 cycle throughput on Nehalem, and I'd guess that if anything it's worse on more recent CPUs since segmentation is obsolete. Agner stopped testing mov to/from segment register performance after Nehalem.
Are you doing this to let you copy more than 64kiB total? If so, at least copy a full 64kiB before changing a segment register.
I think you can use 32-bit addressing modes to avoid messing with segments, but segments that you set in 16-bit mode implicitly have a "limit" of 64k. (i.e. mov eax, [esi]
is encodable in 16-bit mode, with an operand-size and address-size prefix. But with a value in esi of more than 0xFFFF, I think it would fault for violating the ds
segment limit.) The the osdev link below for more.
As Cody says, use rep movsd
to let the CPU use an optimized microcoded memcpy
. (or rep movsb
, but only on CPUs with the ERMSB feature.
In practice, most CPUs that support ERMSB give the same performance benefit for rep movsd
too, so it's probably easiest to just always use rep movsd
. But IvyBridge might not.) It's much faster than separate movsd
instructions (which are slower than separate mov
loads/stores). A loop with SSE 16B vector loads/stores might go almost as fast as rep movsd
on some CPUs, but you can't use AVX for 32B vectors in 16-bit mode.
Another option for big copies: huge unreal mode
In 32-bit protected mode, the values you put in segments are descriptors, not the actual segment base itself. mov es, ax
triggers the CPU to use the value as an index into the GDT or LDT and get the segment base / limit from there.
If you do this in 32-bit mode and then switch back to 16-bit mode, you're in huge unreal mode with segments that can be larger than 64k. The segment base/limit/permissions stay cached until something writes a segment register in 16-bit mode and puts it back to the usual 16*seg
with a 64k limit. (If I'm describing this correctly). See http://wiki.osdev.org/Unreal_Mode for more.
Then you may be able to use rep movsd
in 16-bit mode with operand-size and address-size prefixes so you can copy more than 64kiB in one go.
This works well for ds
and es
, but interrupts will set cs:ip
, so this isn't convenient for big flat code address space, just data.
The segment registers are all 16 bits in size. Compare that to the e?x
registers, which are 32 bits in size. Obviously, these two things are not the same size, prompting your assembler to generate an "operand size mismatch" error—the sizes of the two operands do not match.
Presumably, you want to initialize the segment register with the lower 16 bits of the register, so you would do something like:
mov ds, ax
mov es, bx
Also, no, you don't actually need to initialize the segment registers on each iteration of the loop. What you're doing now is incrementing the segment and forcing the offset to 0, then copying 4 DWORDs. What you should be doing is leaving the segment alone and just incrementing the offset (which the MOVSD
instruction does implicitly).
mov eax, _kernel_segment ; TODO: see why these segment values are not
mov ebx, _kernel_reloc_segment ; already stored as 16 bit values
mov ecx, _kernel_para_size
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
.loop:
movsd
movsd
movsd
movsd
dec ecx
jnz .loop
But note that adding the REP
prefix to the MOVSD
instruction would allow you to do this even more efficiently. This basically does MOVSD
a total of ECX
times. For example:
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
shl ecx, 2 ; adjust size since we're doing 1 MOVSD for each ECX, rather than 4
rep movsd
Somewhat counter-intuitively, if your processor implements the ERMSB optimization (Intel Ivy Bridge and later), REP MOVSB
may actually be faster than REP MOVSD
, so you could do:
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
shl ecx, 4
rep movsb
Finally, although you've commented out the CLD
instruction in your code, you do need to have this in order to ensure that the moves happen according to plan. You cannot rely on the direction flag having a particular value; you need to initialize it yourself to the value that you want.
(Another alternative would be streaming SIMD instructions or even floating-point stores, neither of which would care about the direction flag. This has the advantage of increasing memory copy bandwidth because you'd be doing 64-bit, 128-bit, or larger copies at a time, but introduces other disadvantages. In a kernel, I'd stick with MOVSD
/MOVSB
unless you can prove isn't a significant bottleneck and/or you want to have optimized paths for different processors.)