Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).
The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.
So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.
I guess the idea would be a long-latency instruction that doesn't use many resources, like pause
and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P
we just issue P/L
of these instructions for a calibrated busy-sleep. Well pause
doesn't meet that latter criterion, as its latency varies a lot1.
A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc
or some other clock reading method (clock_gettime
, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.
Any better options?
1 Also pause
has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.
Some random musing on the subject.
So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.
The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).
In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.
When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.
So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.
On the other hand reading the real time clock in the main thread would cost ...
~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.
Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.