Welcome to the penultimate post in my series of blog posts on Vulkan, thanks for following so far – one more to go after this one!
In this post I’ll be doing some analysis of why and how Vulkan is an explicit API, and what exactly that means. There is a lot of mention of Vulkan being a low-level API, and in some ways that’s true, but a lot of work is still abstracted from developers to handle cross-vendor compatibility.
What most people really mean is that it’s an explicit API, in that you have to tell it exactly what you want, rather than relying on defaults or driver heuristics to do the right thing.
By not being explicit, older APIs suffer from a number of subtle issues that are hard to debug, and can be difficult, or even impossible, to fix or work around.
Often when using an application you’ll notice a bit of stuttering, where the application seems to freeze or lag for a period, before returning to normal usage. In most cases this is fairly benign, but it can be grating. In a game environment, it has the effect of disturbing immersion, or missing that head shot; in a graphics UI, it’s just plain annoying. Probably the worst environment it can occur in is VR applications, where stuttering can be the difference between an enjoyable experience and dizziness, headaches and vomiting.
Some of this can be attributed to the behaviour of the graphics API, or more specifically the behaviour of the implementations of that API. Older APIs are expected to protect you from yourself – there’s a lot of wording about on how the API should operate, but nothing really makes a binding contract until the point you actually try to observe it.
There’s a lot of language in OpenGL ES, for example, that says implementations should behave as if operations occur a certain way – implementations have a lot of leeway to improve performance or efficiency by rearranging things behind an application’s back. Since most of the dependencies are implicit in OpenGL ES, the driver can decide for you what is actually a dependency, which means heuristics being used to determine how best to map something to the hardware.
Three very common examples of this are described below:
In OpenGL ES, a call to glTexSubImage2D takes a piece of source color data, and copies it into a texture. This seems fairly straightforward, right?
Question: When does the data upload happen? At the point you call the function? Pipelined between other functions somehow?
Answer: Probably none of the above!
Depending on the implementation you’re running on, there’s a good chance that texture uploads are deferred until they’re actually about to be used (i.e. in a draw call). There’s a couple of reasons to do this:
- Batching work
- Uploading a sprite map is a good example of batching – creating one usually means lots of small uploads to portions of a texture, and an implementation can get better performance by doing it all at once.
- Improve performance of badly behaved apps
- In the case of badly behaved applications, there’s a surprisingly high number of applications that upload data to textures multiple times in a row – completely overwriting what came before. In order to combat this, drivers wait until they app actually uses the texture before doing anything.
The trouble with an implicit API like OpenGL ES is that there’s no specific point at which an application says “I need this texture now“. There are a lot of places where the application probably needs it though; such as when binding it and then drawing with that texture bound. This means that unless an application triggers an upload via a dependency, the upload won’t happen until the first draw call that uses it. Draw calls are in your main render loop, and a texture upload is a comparatively slow and expensive operation – running this right before a draw call can lead to a stutter as the data gets transferred.
In a very similar situation to texture uploads, shader compilation is an expensive operation, though this time on the CPU rather than with bandwidth or involving the GPU. Yet again, calling glCompileShader is unlikely to directly compile your shader – it just adds it to a list of things that should be compiled.
On some implementations, this compilation is farmed out to another thread, allowing the main rendering thread to continue uninhibited, in order to improve performance. As we saw in the last blog post, it’s quite hard for applications to do threading well, so some implementations just do it for you, bypassing the awkwardness of the spec. This has actually been crystallized into a proper extension after doing it behind applications’ backs for years: GL_ARB_parallel_shader_compile
Some implementations may completely defer the work until the point that you actually bind it for use and draw with it – again in the same way as texture uploads, and suffering from the same problems of stuttering, though this time on the CPU. Again this is mostly to deal with historically bad applications – there’s no point compiling a shader if an application doesn’t actually use it, as compilation is a very expensive operation.
Draw Call Submission
Now to the most fundamental operation in the API – draw calls. On most architectures, it’d be a huge performance hit to flush a draw call through the hardware as soon as it is called: modern GPUs are complex, pipelined processors that have an amount of overhead for submitting work to them – not a significant amount if there’s only a few submissions, but it would add up quickly if each draw was a separate submission of work.
In order to avoid this, GPU drivers internally use their own command buffers which will get submitted at some point. The at some point is important – again OpenGL ES is implicit, so a driver has to take cues from any implicit dependencies. For instance, eglSwapBuffers is commonly used as a submission point, and there are others.
Another big implicit issue in OpenGL ES that isn’t entirely obvious from the API is the use of memory. Applications usually have a lot of control over their own allocations, but allocations within an OpenGL ES implementation are broadly opaque. Superficially, it might seem like the biggest memory allocations in the API (textures and buffers) should be reasonably easy to figure out. In actual fact, there’s a whole lot of shenanigans going on in your typical OpenGL ES driver which make that determination very difficult to make.
For instance, as mentioned above, a texture might not get uploaded until the first draw call that uses it. If no draw call ever uses it – it might not even get allocated – and there’s no explicit way to determine this.
If memory consumption gets out of hand, the program may simply terminate. On some operating systems, once an application uses more than a certain amount of memory, it gets terminated with little more than a debug message to let you know what happened. Even if the OS doesn’t terminate the program directly, most applications are unable to recover from an out of memory condition, and will either stop working because they caught the error, or inexplicably crash or cease rendering because they didn’t catch it.
We’ve already covered that the GPU runs asynchronously to the CPU, despite how the API appears, and draw calls are buffered up, rather than being submitted at once. Suppose you have an application that draws with a texture, then uploads a bit of data to that texture, draws again, uploads a bit more, and so on. In order for that to make sense, the texture update has to happen, and the next draw then has to be completely done with that texture before uploading the next portion of data. In other words, the uploads and draws have to be serialised – the problem being that texture uploads are supposed to complete immediately (unless you’re using pixel unpack buffers). We’ve already mentioned that kicking each draw and waiting for it would be really bad for performance, so what to do?
The relatively simple answer is to do the same for uploads as is done for draw calls – perform them asynchronously. But this is complex – there’s a CPU pointer that’s been passed into the function, and the API says that it’s safe to delete that data after the function completes. The only way to really do this is to copy the data being modified into another bit of memory, in a process called ghosting. In extreme cases like the one above, this leads to multiple copies of the texture existing throughout the scene, each consuming more and more memory that’s completely opaque to the application.
A good write up of how to do texture uploads well in OpenGL ES on our architecture can be found here, and it goes into more detail about how this works than I can really fit here.
Vulkan is explicit
Vulkan attempts to make as much of the problematic implicitness of older APIs into explicit application choices; there’s no longer any implicit dependencies and memory allocation is completely handled by the application. A combination of all the explicit portions of the API are needed in order to fix this – no one part really solves those issues in isolation.
The method of allocating a resource in Vulkan is significantly more explicit than it was in OpenGL ES. The sequence of commands required to create and allocate an image, for instance, is roughly as follows:
- Create an image object with the desired format, size, etc.
- Query the image object for its memory requirements
- Pick a suitable memory type/heap to allocate from
- Allocate that memory
- Bind the memory to the image
Those steps aren’t entirely trivial either: For instance, you should really be allocating large memory objects, and binding portions of that to each resource – naively allocating a lot of small memory objects may have a negative performance impact. On some platforms there’s even a limit on the number of memory objects you can create that could be hit well before the system is out of memory.
Explicit data transfer
As well as resource allocation being more explicit – so is actually uploading data to resources. The act of uploading data to resources is now done via either a direct memory map operation for resources with linear memory (buffers, linear images), or via a copy from another resource for non-linear memory (device-optimal images). For instance, to upload a texture, you’d likely do the following chain of operations:
- Create and Bind memory to a buffer
- Persistently map that memory to the host CPU
- Read your texture data from disk, directly into that mapped memory
- Submit a command to copy data from the buffer to the image
There’s also a lot of synchronization to handle if you’re doing the sequence above dynamically, as you have both device and host operations which operate completely asynchronously. If you try to copy the data from the buffer whilst you’re still writing it, or writing more data at a subsequent point, synchronization objects need to be used to coordinate it. There’s no real scope for ghosting going on behind your back – data hazards and flushing are very explicitly exposed in Vulkan.
Explicit dependencies and synchronization
Almost any operation you can think of in Vulkan needs some form of synchronization. Vulkan provides very few implicit ordering guarantees, even between individual commands in a command buffer. Different architectures process work in completely different orders; Tile-Based Deferred Renderers (TBDR GPUs) execute all vertex processing well before any fragment processing, whereas an Immediate Mode Renderer (IMR GPUs) will pipeline everything together. By explicitly stating that there’s no guaranteed order between many operations, it allows the implementation to run as fast as possible, whilst still giving the user the opportunity to explicitly ask for a guaranteed order, only if they really need it.
This is in contrast to OpenGL ES where an application could assume that they’d get everything in a guaranteed order, and an implementation had to try to pick apart anywhere that could potentially be optimised.
The same is true of CPU to GPU work; Vulkan provides the ability for the host to both wait on events from the GPU and also some ability to directly trigger events for the GPU.
Explicit work submission
My favourite topic! Queue submission is separated from command generation. I cannot stress enough how important this part of the API is for so many reasons. I’m not going to go into details here again, because I’ve covered it in my previous posts, but suffice to say this helps out here too. Vulkan guarantees that no drawing work will begin when you call a draw function – all that happens is that draw call gets recorded into a command buffer.
Once you have created your command buffers, you control exactly when the submission happens by submitting it to a queue. No GPU work can happen on command buffers before that point, and the application can receive explicit notification of when that work has finished as well.
Explicit render pass delimiting
A huge number of implicit flushes in OpenGL ES, particularly for tiling architectures, are caused by operations that happen when a tiler would otherwise be mid-render. Vulkan explicitly delimits where render passes both begin and end, with that information being used by a tile-based GPU to handle transitioning data to and from the framebuffer.
As well as this, Vulkan forbids any operations inside such a section that would cause a flush in the middle of a render pass. This is useful for the simple fact that it makes application developers consider what they’re doing when they introduce a dependency. Almost more importantly than that, it means there’s only one point of where tile load/store operations for an application to specify, so a tile-based GPU has all the information it needs in order to perform this transition as efficiently as possible, rather than being forced to load and store everything like it would in OpenGL ES.
That’s only a fairly broad overview, and it might not make complete sense unless you’re familiar with tile-based architectures. Trying to go into depth here would require a whole blog post – which is conveniently what I’ll be doing in the final blog post of this series, where I will take a detailed look at the render pass structure, as well as other architecture-positive changes to the API.
Vulkan provides a number of explicit mechanisms for many operations that were either hidden or very coarsely specified in OpenGL ES. This can be a double-edged sword though:
- On the one hand it allows applications to be more expressive. In turn, good use of the API will result in performance speedups not previously possible with older APIs.
- However there’s no option to not do this – it forces the application to really consider how it expresses itself. This means that if an application does it wrong, then they either get incorrect results if they don’t express all their dependencies, or poor performance if they are over-zealous.
Yet again it falls to applications to be very careful with what they do here – there’s potential to squeeze more performance out of a GPU, but an almost equal amount of potential to slow it down; performance tuning applications like PVRTune will be absolutely critical to writing a well-performing application.
If you want to learn more about Vulkan, register now for our last upcoming seminar:
In case you missed them, also check out my other blog posts and webinars on Vulkan.