I've written several copy functions in search of a good memory strategy on PowerPC. Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. Initially pleased with that, I threw in a regular memcpy to see how it compared... 10x faster than my best! I have no intention of rewriting memcpy, but I do hope to learn from it and accelerate several simple image filters that spend most of their time moving pixels to and from memory.
Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration, the performance advantage of memcpy is still embarrassing. I'm using dcbz to free up bandwidth, Apple uses nothing, but both codes tend to hesitate on stores.
prefetch
dcbt future
dcbt distant future
load stuff
lvx image
lvx image + 16
lvx image + 32
lvx image + 48
image += 64
prepare to store
dcbz filtered
dcbz filtered + 32
store stuff
stvxl filtered
stvxl filtered + 16
stvxl filtered + 32
stvxl filtered + 48
filtered += 64
repeat
Does anyone have some ideas on why very similar code has such a dramatic performance gap? I'd love to marinate the real image filters in whatever secret sauce memcpy is using!
Additional info: All data is vector aligned. I'm making filtered copies of the image, not replacing the original. The code runs on PowerPC G4, G5, and Cell PPU. The Cell SPU version is already insanely fast.
Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration
I may be stating the obvious, but since you don't mention the following at all in your question, it may be worth pointing it out:
I would bet that Apple's choice of 4 vectors reads followed by 4 vector writes has as much to do with the G5's pipeline and its management of out-of-order instruction execution in "dispatch groups" as it has with a magical 64-byte perfect line size. Did you notice the line skips in Nick Bastin's linked bcopy.s? These mean that the developer thought about how the instruction stream would be consumed by the G5. If you want to reproduce the same performance, it's not enough to read data 64 bytes at a time, you must make sure your instruction groups are well filled (basically, I remember that instructions can be grouped by up to five independent ones, with the first four being non-jump instructions and the fifth only being allowed to be a jump. The details are more complicated).
EDIT: you may also be interested by the following paragraph on the same page:
The dcbz instruction still zeros aligned 32 byte segments of memory as per the G4 and G3. However, since that is not a full cacheline on a G5 it will not have the performance benefits that you were likely hoping for. There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline.
I don't know exactly what you're doing, since I can't see your code, but Apple's secret sauce is here.
Maybe it's because of CPU caching. Try to run CacheGrind:
Cachegrind is a cache profiler. It
performs detailed simulation of the
I1, D1 and L2 caches in your CPU and
so can accurately pinpoint the sources
of cache misses in your code. It
identifies the number of cache misses,
memory references and instructions
executed for each line of source code,
with per-function, per-module and
whole-program summaries. It is useful
with programs written in any language.
Cachegrind runs programs about
20--100x slower than normal.
Still not an answer, but did you verify that memcpy is actually moving the data? Maybe it was just remapped copy-on-write. You would still see the inner memcpy loop in Shark as part of the first and last pages are truly copied.
As mentioned in another answer, "dcbz", as defined by Apple on the G5, only operates on 32-bytes, so you will lose performance with this instruction on a G5 which has 128 byte cachelines. You need to use "dcbzl" to prevent the destination cacheline from being fetched from memory (and effectively reducing your useful read memory bandwidth by half).