I understand that complete GPUs are behemoths of computing - including every step of calculation, and memory. So obviously a GPU can compute whatever we want - it's Turing complete.
My question is in regard to a single shader on various GPUs ("Stream Processor"/"CUDA Core"):
Is it Turing complete?
Can I (in theory) compute an arbitrary function over arbitrary inputs by using a single shader?
I'm trying to understand at what "scale" of computation shaders live.
Did You mean shader as a program used to compute shading?
On wiki talk I found:
(...)Shader models 1.x and 2.0 are indeed not Turing complete, because
they lack a generalised iteration capability (they do have some
limited looping constructs, but this is effectively unrolled at
compile time, so the number of iterations must be constant).
Shader model 3.0, which is used in the latest PC hardware and on
Xbox 360, has fully general looping abilities and is Turing complete
in the theoretical sense. This rather nicely highlights the difference
between theory and practice, though! When people claim a device is
Turing complete, what they actually mean is "if this had infinite time
and infinite storage, it would be Turing complete". Shader model 3.0
is still extremely limited in register space and program instruction
count, so it fails rather badly when put to any real world test.
Note that even shader 1.x can become Turing complete if you allow
multiple passes of rendering operations. For instance it is trivial to
implement the Game of Life using repeated render-to-texture
operations. In this case the input and output textures provide a large
amount of storage space, and the repeated render calls fills in for
the missing iteration constructs. This is cheating, though, because it
is depending on the CPU to issue the render calls!
as an example of non-turing-complete languages:
Wikipedia Page on non-turing-complete Shaders
Generally it depends on shader language (and Your Turing-complete requirements), but I think that most recent shader languages can be called Turing complete (if we ignore any limitations of finite memory) bacause they can loop and read/write variables.
EDIT:
If I misunderstood Your question and You mean shader as shader processing unit (like Cuda core) then I think that single core should not be considered in category of Turing complete or not complete. GPU is not only built up on cores. Answering Your question you can program GPU with any number of cuda cores to "compute an arbitrary function over arbitrary inputs".