How efficient is a try_lock on a mutex?

How efficient is a try_lock on a mutex? I.e. how much assembler instructions are there likely and how much time are they consuming in both possible cases (i.e. the mutex was already locked before or it was free and could be locked).

In case you have problems to answer the question, here is a how to (in case that is really unclear):

If that answer depends a lot on the OS implementation and hardware: Please answer it for common OS`s (e.g. Linux, Windows, MacOSX), recent versions of them (in case they differ a lot from earlier versions) and common hardware (x86, amd64, ppc, arm).

If that also depends on the library: Take pthread as an example.

Please also answer if they really differ at all. And if they differ, please state the differences. I.e. what do they do differently? What common algorithms are there around? Are there different algorithms around or do all common systems (common by the above list if that is unclear) have implemented mutexes just in the same way?

As of this Meta discussion, this really should be a separate question.

Also, I have asked this as a separate question from the performance of a lock because I am not sure if try_lock may behave different. Maybe also depending on the implementation. Then again, please answer it for common implementations. And this very similar/related question obviously shows that this is an interesting question which can be answered.

A mutex is a logical construction that is independent of any implementation. Operations on mutexes therefore are neither efficient nor inefficient - they are simply defined.

Your question is therefore akin to asking "How efficient is a car?", without reference to what kind of car you might be talking about.

I could implement mutexes in the real world with smoke signals, carrier pigeons or a pencil and paper. I could also implement them on a computer. I could implement a mutex with certain operations on a Cray 1, on an Intel Core 2 Duo, or on the 486 in my basement. I could implement them in hardware. I could implement them in software in the operating system kernel, or in userspace, or using some combination of the two. I might simulate mutexes (but not implement them) using lock-free algorithms that are guaranteed conflict-free within a critical section.

EDIT: Your subsequent edits don't help the situation. "In a low level language (like C or whatever)" is mostly irrelevant, because then we're into measuring language implementation performance, and that's a slippery slope at best. "[F]rom pthread or whatever the native system library provides" is similarly unhelpful, because as I said, there are so many ways that one could implement mutexes in different environments that it's not even a useful comparison to make.

This is why your question is unanswerable.