Are GPU shaders Turing complete

I understand that complete GPUs are behemoths of computing - including every step of calculation, and memory. So obviously a GPU can compute whatever we want - it's Turing complete.

My question is in regard to a single shader on various GPUs ("Stream Processor"/"CUDA Core"):
Is it Turing complete?
Can I (in theory) compute an arbitrary function over arbitrary inputs by using a single shader?
I'm trying to understand at what "scale" of computation shaders live.

Did You mean shader as a program used to compute shading?

On wiki talk I found:

(...)Shader models 1.x and 2.0 are indeed not Turing complete, because they lack a generalised iteration capability (they do have some limited looping constructs, but this is effectively unrolled at compile time, so the number of iterations must be constant).

Shader model 3.0, which is used in the latest PC hardware and on Xbox 360, has fully general looping abilities and is Turing complete in the theoretical sense. This rather nicely highlights the difference between theory and practice, though! When people claim a device is Turing complete, what they actually mean is "if this had infinite time and infinite storage, it would be Turing complete". Shader model 3.0 is still extremely limited in register space and program instruction count, so it fails rather badly when put to any real world test.

Note that even shader 1.x can become Turing complete if you allow multiple passes of rendering operations. For instance it is trivial to implement the Game of Life using repeated render-to-texture operations. In this case the input and output textures provide a large amount of storage space, and the repeated render calls fills in for the missing iteration constructs. This is cheating, though, because it is depending on the CPU to issue the render calls!

as an example of non-turing-complete languages: Wikipedia Page on non-turing-complete Shaders

Generally it depends on shader language (and Your Turing-complete requirements), but I think that most recent shader languages can be called Turing complete (if we ignore any limitations of finite memory) bacause they can loop and read/write variables.

EDIT:

If I misunderstood Your question and You mean shader as shader processing unit (like Cuda core) then I think that single core should not be considered in category of Turing complete or not complete. GPU is not only built up on cores. Answering Your question you can program GPU with any number of cuda cores to "compute an arbitrary function over arbitrary inputs".