Mutual exclusion algorithm using only atomic reads

I am trying to come up with a mutual exclusion algorithm that is based only on atomic reads and atomic writes of shared memory (i.e. no compare-and-swap or similar).

Apart from mutual exclusion and deadlock freedom, it needs to fulfill the following properties:

It needs to be robust in the face of processes dying at any point
It needs to handle an ex-ante unknown number of processes competing for the lock
Every few seconds I need one of the processes to enter the critical section (I don't care which)
It should be as memory-efficient as possible
All my processes share a reasonably synchronized clock-source (sub-second accuracy)

I have come up with a way that looks to me like it could work, but I wanted to run it by you guys to see if I made a mistake somewhere or if there is a more elegant solution.

The idea is to combine a modified version of Szymanski's Three-Bit Linear Wait Algorithm (Figure 1 on page 4 of "Mutual Exclusion Revisited") with a process-naming-scheme based loosely on Moir and Anderson's "Wait-Free Algorithms for Fast, Long-Lived Renaming" (M&A).

As a first step, I define an "order" for incoming processes by assigning them an ID like M&A:

M&A use the following mutex primitive (see Figure 2 on page 5):

where out of n processes entering the block, at most 1 will stop there, at most n - 1 will move right, and at most n - 1 will move down.

These primitives can be linked together into a grid, where I decided to number the boxes differently from M&A, to allow me to calculate the box number without knowing the total number of boxes:

The box number can be calculated using box = (m*(m-1))/2 + r, where m is the number of total moves made (including the move into the top-left box, i.e. m ≥ 1) and r is the number of moves to the right (r ≥ 0).

Any new process will initially be assigned a large, globally unique ID (e.g. timestamp + huge random number), which it will use as the value for p in the mutex primitive algorithm above. It will then start traversing the grid following the rules of the primitive, starting in the top-left corner, until it stops in some box. The number of the box in which it stopped will now be its new process ID.

To keep the number of process IDs smaller (processes might disappear; and also: boxes can be locked forever without being used if all entering processes move either right or down and none stop there), I will replace the boolean variable Y in the primitive above with a time-stamp.

The line Y := true will then be replaced with Y := now, where now is the current time. Similarly, the condition if Y then ... will be replaced with if (now - Y < timeout) then ..., where timeout is the time after which a box will be returned to the available pool.

While a process is still running, it needs to keep setting the Y-value of its box to now to never allow its ID to expire. While smaller values of timeout will lead to smaller IDs and less memory use, it can theoretically be chosen arbitrarily large and thus it should always be possible to find a value for timeout that will guarantee that processes will never lose their ID unless they actually quit or died. Values of minutes, hours, or even days should do the trick.

Now, we can use this process order to implement Szymanski's algorithm:

communication variables:: a, w, s: boolean = false
private variables:: j: 0..n

p1      ai=true;
p2      for(j=0;j<n;j++)
            while(sj);
p3      wi=true;
        ai=false;
p4      while(!si) {
p5          for (j=0;j<n & !aj;j++);
p6          if (j==n) {
                si=true;
p6.1            for (j=0;j<n & !aj;j++);
p6.2            if (j<n)
                    si=false;
p6.3            else {
                    wi=false;
p6.4                for(j=0;j<n;j++)
                        while(wj);
                }
            }
p7          if (j<n)
                for (j=0;j<n & (wj | !sj);j++);
p8          if (j!=i & j<n) {
p8.1            si=true;
                wi=false;
            }
        }
p9      for(j=0;j<i;j++)
            while(wj | sj);

Critical Section

e1      si=false;

The idea behind this algorithm is a little involved and for brevity (if that isn't already a lost cause at this point), I will not try to paraphrase it again here (Wikipedia also discusses a variant of this algorithm).

To make this algorithm work as needed (i.e. in the face of process aborts and an unknown total number of processes) the following changes can be made:

Just like Y above, the booleans a_i, w_i, and s_i will be replaced with timestamps. Except, this time the timeouts need to be short enough to still allow the critical section to be entered as often as needed, in my case every few seconds, so a timeout of 5s might work. At the same time, the timeout needs to be long enough so that it never expires unless a process dies, i.e. it needs to be longer than the worst-case time it takes processes to keep updating the timestamps for flags that need to remain set.

The loops over processes (e.g. for(j=0;j<n;j++)) will be turned into traversals of the process-ID grid from above in the order of the box-numbers. Each box can have 3 states: Never visited by anyone, a process has stopped in it, or all processes moved on and the box is currently blocked. No distinction needs to be made between the latter two, since a blocked box will simply correspond to a process that is currently not trying to enter the critical section (i.e. a_i, w_i, and s_i are false/expired). The loop starts in box 0. If it comes across a box that has been visited (i.e. has its X-value set), it adds the adjacent boxes to its "to-check-list" (i.e. if box 8 was seen, box 12 and 13 will be added to the list). The loop terminates when the "to-check-list" has been processed completely (unless, of course, there is another stopping condition present that fires first (e.g. j < i or & !aj)).

Now to my questions:

Does this look like it would work reliably?
I've never done a formal proof of a framework like this. My gut (and intense pondering) tell me that it should be solid, but I wouldn't mind something a bit more rigorous than that. I'm most worried about the effects of a process dying between p3 and e1 of Szymanski's algorithm and the flags expiring randomly while other processes are in that part of the code.
Is there a better solution (simpler/less resource use)?
One point of attack for an optimization might be the fact that I don't need all processes to eventually make it to the critical section, I just need one (and I don't care which one either).

Sorry for the length of this post and thank you for your endurance reading it! Suggestions for shortening welcome... :)

PS: I also posted this question on the CS stack exchange.

Since I don't really care about fairness or starvation (it doesn't matter which process enters the critical section), I don't really need to use an algorithm as complex as Szymanski's.

I found a quite beautiful alternative: An algorithm by Burns and Lynch:

program Process_i;
    type flag = (down, up);
    shared var F : array [1..N] of flag;
    var j : 1..N;
begin
    while true do begin
        1: F[i] := down;
        2: remainder;     (* remainder region *)
        3: F[i] := down;
        4: for j := 1 to i-1 do
               if F[j] = up then goto 3;
        5: F[i] := up;
        6: for j := 1 to i-1 do
               if F[j] = up then goto 3;
        7: for j := i+1 to N do
               if F[j] = up then goto 7;
        8: critical;      (* critical region *)
    end
end.

You can find it on page 4 (836) of their paper "Mutual Exclusion Using Indivisible Reads and Writes".

It has a couple of major advantages:

It is much simpler
It uses less memory (In fact, they state that their algorithm is optimal in that respect)
All shared memory has only a single writer (but multiple readers, of course)
It is easy to turn the busy-waits into delayed retries (see here)

Aside: The idea of this algorithm can be visualized as follows:

I'm a process and I'm standing somewhere in a row with other people (all other processes).
When I want to enter the critical section, I do the following:

3/4: I look to my left and keep my hand down as long as I see someone with their hand up.
5: If no-one to my left has their hand up, I put mine up.
6: I check again if anyone to my left has meanwhile put their hand up. If so, I put mine back down and start over. Otherwise, I keep my hand up.
7: Everyone to my right goes first, so I look to my right and wait until I don't see any hands up.
8: Once all hands to my right are down, I can enter the critical section.
1: When I'm done, I put my hand back down.

I implemented the whole idea from above (including M&A) with this algorithm and am currently testing the heck out of it and so far it seems very stable.

The implementation is very straight forward. There were really only two additional caveats I had to consider (unless, of course, I missed something (pointers welcome)):

If a process makes it to line 7 in the algorithm, it might end up hitting the goto, in which case, in my implementation (see here), the entry to the critical section is denied and the process tries again later. When the retry happens (possibly with a big delay), I need to refresh the process's flag in such a way that it didn't have even the tiniest moment during which the flag was expired, or, if it did, I need to detect that and jump back to line 3 in the algorithm instead of continuing at line 7.
I added a check whether the critical section needs to be entered at all (i.e. a statement that limits the rate at which the critical section is entered). It is most efficient to perform this check right before a process even tries to enter the critical section. To make 100% sure that no thread can ever exceed the rate limit, though, a second check will need to be performed when a process successfully entered the critical section.

Due to the shared data-structures that my code uses, it's currently not really in a state that makes it meaningful to post here, but with a little bit of work, I could put together a self-contained version. If anyone is interested, please leave a comment and I will do so.