On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16
.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.