I'm trying to perform a memory optimization that should be theoretically possible but that I'm starting to doubt is within arm-elf-gcc's capability. Please show me that I'm wrong.
I have an embedded system with a very small amount of main memory, and an even smaller amount of battery-backed nvram. I am storing checksummed configuration data in the nvram so that on boot I can validate the checksum and continue a previous run or start a new run if the checksum is invalid. During the run, I update various fields of various sizes in this configuration data (and it's okay that this invalidates the checksum until it is later recalculated).
All of this runs in physical address space - the normal sram is mapped at one location and the nvram is mapped at another. Here's the rub - all access to the nvram must be done in 32-bit words; no byte or halfword access is allowed (although it's obviously fine in main memory).
So I can either a) store a working copy of all of my configuration data in main memory, and memcpy it out to the nvram when I recalculate the checksum or b) work with it directly in nvram but somehow convince the compiler that all structs are packed and all accesses must not only be 32-bit aligned, but also 32-bit wide.
Option a) wastes precious main memory, and I would much rather make the runtime tradeoff to save it (although not if the code size ends up wasting more than I save on data size) via option b).
I was hoping that __attribute__ ((packed, aligned(4)))
or some variation thereof could help here, but all of the reading and experimenting I have done so far has let me down.
Here's a toy example of the sort of configuration data I'm dealing with:
#define __packed __attribute__ ((packed))
struct __packed Foo
{
uint64_t foo;
struct FooFoo foofoo;
}
struct __packed Bar
{
uint32_t something;
uint16_t somethingSmaller;
uint8_t evenSmaller;
}
struct __packed PersistentData
{
struct Foo;
struct Bar;
/* ... */
struct Baz;
uint_32 checksum;
}
You can imagine different threads (one each to perform functions Foo, Bar, and Baz) updating their own structures as appropriate, and synchronizing at some point to declare it time to recalculate the checksum and go to sleep.
You can probably do it if you make everything a bitfield:
However you might be outsmarted by your compiler. You'd have to check the resulting assembly.
Avoid bitfields they are well known to be a problem with the C language, unreliable, non-portable, subject to change in implementation at any time. And wont help you with this problem anyway.
Unions come to mind as well, but I have been corrected enough times on SO that you cannot use unions to change types according to the C standards. Although as I assume with the other poster, I have not seen a case yet where using the union to change types has not worked. Broken bitfields, constantly, broken union memory sharing, so far no pain. And unions wont save you any ram so doesnt really work here.
Why are you trying to make the compiler do the work? You would need to have some sort of linker type script at compile time that instructs the compiler to do 32 bit accesses with masks, shifts, read-modify-writes, for some address spaces, and for others use the more natural word, halfword and byte accesses. I have not heard of gcc or the C language having such controls be it in the syntax, or a compiler script or definition file of some sort. And if it does exist it is not used widely enough to be reliable, I would expect compiler bugs and avoid it. I just dont see the compiler doing it, certainly not in a struct kind of manner.
For reads you might get lucky, depends heavily on the hardware folks. Where is this nvram memory interface, inside the chip made by your company, by some other company, on the edge of the chip, etc? A limitation like the one you describe in part may mean the control signals that distinguish access size or byte lanes may be ignored. So an ldrb might look to the nvram as a 32 bit read and the arm will grab the correct byte lane because it thinks it is an 8 bit read. I would do some experiments to verify this, there is more than one arm memory bus and each has many different types of transfers. Perhaps talk to the hardware folks or do some hdl simulations if you have that available to see what the arm is really doing. If you cannot take this shortcut, a read is going to be a ldr with a possible mask and shift no matter how you get the compiler to do it.
Writes other than word sized have to be read-modify-write. ldr, bic, shift, or, str. No matter who does it, you or the compiler.
Just do it yourself, I cannot see how the compiler will do it for you. Compilers including gcc have a hard enough time performing the specific access you seem to think are telling it:
My syntax is probably wrong because I gave this up years ago, but it does not always produce an unsigned int sized store, and when the compiler doesnt want to, it wont. if it cannot do that reliably how can you expect it to create one flavor of loads and stores for this variable or struct and another flavor for that variable or struct?
So if you have specific instructions you need the compiler to produce, you will fail, you have to use assembler, period. In particular, ldm, ldrd, ldr, ldrh, ldrb, strd, str, strh, strb and stm.
I dont know how much nvram you have but it seems to me the solution to your problem is make everything in nvram 32 bits in size. You burn a few extra cycles performing the checksum, but your code space and (volatile) ram usage is at a minimum. Very very little assembly required (or none if you are comfortable with that).
I also recommend trying other compilers if you are worried about that much optimization. At a minimum try gcc 3.x, gcc 4.x, llvm, and rvct which I think there is a version that comes with Keil (but dont know how it compares to the real rvct compiler).
I dont have a feel for how small your binary has to be. If you have to pack stuff into nvram and cannot make it all 32 bit entries, I would recommend several assembler helper functions, one flavor of get32 and put32, two flavors of get16 and put16, and four flavors of get8 and put8. You will know as you are writing the code where things are packed, so you can code directly or through macros/defines which flavor of get16 or put8. These functions should only have a single parameter, so there is zero code space cost using them, performance is in the form of a pipe flush on the branch, depending on your flavor of core. What I dont know is, is this 50 or 100 instructions of put and get functions going to break your code size budget? If so I wonder if you should be using C at all. In particular gcc.
And you probably want to use thumb instead of arm if size is that critical, thumb2 if you have it.
I dont see how you would get the compiler to do it for you, would need to be some compiler specific pragma thing, which is likely to be rarely used and buggy if it exists.
What core are you using? I have been working with something in the arm 11 family with an axi bus recently and arm does a really good job of turning sequences of ldrs, ldrbs, ldrhs, etc into individual 32 or 64 bit reads (yes a few separate instructions may turn into a single memory cycle). You might just get away with tailoring your code to the features of the core, depending on the core and where this arm to nvram memory interface lies. Would have to do lots of sims for this though, I only know this by looking at the bus not from any arm documentation.
Since it's difficult to know what a compiler might do with a bitfield (and sometimes even a union), for safety I'd create some functions that get/set specific-sized data from arbitrary offsets using only aligned read/writes.
Something like the following (untested - not even compiled) code:
Now you can read/write somthing like
myBar.evenSmaller
(assuming thatmyBar
has been laid out by the linker/loader such that it's in the NVRAM address space) like so:Of course, the functions that deal with larger data types might be more complex since they could straddle 32-bit boundaries (if you're packing the structs to avoid unused space taken up by padding). If you're not interested in speed, they can build on the above functions that read/write single bytes at a time to help keep those functions simple.
In any case, if you have multiple threads/tasks reading writing the NVRAM concurrently, you'll need to synchronize the accesses to avoid the non-atomic writes from getting corrupted or causing corrupted reads.
The simplest thing to do would be to use a union.
The one is just an example - you could probably write some arithmetic to be done at compile-time to automatically determine the size. Read and write from NVRam in terms of the useless member, but always access it in main memory in terms of the "real" useful member. This should force the compiler to read and write 32bits at once (each 32bits in the array in the useless struct), but still allow you to easily and type-safely access the real data members.