Flow Control on PowerVR – Optimising Shaders

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

We’re back again with another excerpt from our new documentation website. Today, we’ll be looking at flow control and branching in shaders. This series has already covered a range of topics from mipmapping to balancing GPU workloads and we’re really glad you’ve been enjoying these little titbits so far. Judging by the traffic stats, you’re taking the time to have a read of the site afterwards, so that’s fantastic to see.

For those of you who are unaware, docs.imgtec.com is packed with plenty of information for graphics developers, both new and more experienced, including:

Those of you who are developing for PowerVR hardware are obviously going to get the most out of this site but everybody is sure to find something to interest them, so why not take a look?

With that preamble out of the way, let’s take a quick look at another one of our PowerVR performance recommendations: flow control in shaders.


So we’ll start with the good news: PowerVR hardware supports flow control in both vertex and fragment shaders by default, i.e. without having to explicitly enable any extensions.

That’s great, but what is flow control?

Well, flow control is simply controlling the execution path in a shader through branching or looping using statements like ifelsefor, and so on. This can often lead to multiple branching paths within a shader, which are executed based on some kind of condition. Flow control is a very basic concept when programming for a CPU, but it’s slightly more complicated in shaders because of the highly parallelised nature of GPUs.

An example of this branching can be seen in the fragment shader of one of our PowerVR SDK examples, GaussianBlur.

mediump float imageCoordX = gl_FragCoord.x - 0.5;
mediump float windowWidth = config.x;
mediump float xPosition = imageCoordX / windowWidth;
mediump vec3 col = vec3(0.0);
if(xPosition < 0.5)
    col = texture(sOriginalTexture, vTexCoords[NumGaussianWeightsAndOffsets]).rgb;
else if(xPosition > 0.497 && xPosition < 0.503)
    col = vec3(1.0);
    col = Weights[0] * texture(sTexture, vTexCoords[0]).rgb +
          Weights[1] * texture(sTexture, vTexCoords[1]).rgb +
          Weights[2] * texture(sTexture, vTexCoords[2]).rgb +
          Weights[3] * texture(sTexture, vTexCoords[3]).rgb +
          Weights[4] * texture(sTexture, vTexCoords[4]).rgb +
          Weights[4] * texture(sTexture, vTexCoords[5]).rgb +
          Weights[3] * texture(sTexture, vTexCoords[6]).rgb +
          Weights[2] * texture(sTexture, vTexCoords[7]).rgb +
          Weights[1] * texture(sTexture, vTexCoords[8]).rgb +
          Weights[0] * texture(sTexture, vTexCoords[9]).rgb;
oColor = vec4(col , 1);

In the example above, the execution path depends on the value of xPositionxPosition measures the horizontal position along the screen, so conditional branching can be used to perform different processing on the second half of the image. The result of this branching can be seen clearly in the image of this example below.

Screenshot demonstrating how branching is used in a PowerVR SDK example, GaussiaBlur

In general, when we’re talking about flow control in shaders, we’re usually referring to one of two things:

Static Flow Control

This is a case where a shader has two or more branching paths in code which are conditionally selected depending on the value of some uniform variable. Uniform variables are the same across all vertices/fragments, so the same shader path is executed across all vertices/fragments in a single draw call.

Static flow control is sometimes used to combine many smaller shaders into one large shader (an uber shader!). The shader that is going to be executed is then conditionally selected during runtime. However, often a better solution is just to use preprocessor directives to generate multiple shaders from the uber shader during compilation. This means you can create many shaders from a single source file.

Dynamic Flow Control

This one’s a bit more tricky. Again, a shader with dynamic flow control has multiple branching execution paths but this time the condition which controls the branching changes on a per-vertex or per-fragment basis, often based on texture or vertex attributes.  This means the shader could potentially have to execute different paths from one vertex or fragment to the next.

So why is this a problem? Well, a graphics core uses a single instruction, multiple data (SIMD) architecture, which means all processors in the core must execute the same instruction at the same time. If a graphics core is executing a group of shader invocations (for example when a fragment shader is processing a set of fragments) then all of the invocations must follow the same path. This means that during branching the processors will spend time executing instructions that they don’t really need to. This has much more of an unpredictable effect on performance than static flow control.

Recommendations for PowerVR GPUs

So now we’ve covered a little about flow control, here are a few of our recommendations:

Avoid using discard in conditional branches

It is usually best to avoid branching to discard when developing for PowerVR devices. Using discard in the fragment shader negates some of the key benefits of PowerVR’s TBDR architecture.

This mainly affects hidden surface removal (HSR), as this operation assumes all of the fragments of an opaque object are going to be drawn, occluding anything behind them. If fragments can potentially be discarded in the fragment shader the hardware can no longer assume this, meaning the GPU has to wait until the fragment shader has finished before determining which fragments are visible. This invalidates the “deferred” part of the tile-based deferred rendering (TBDR) and can reduce the performance of an application on PowerVR.

Our advice is to use alpha blending instead.

Avoid sampling textures in conditional branches (PowerVR Series5 and Series 5XT only)

When developing for PowerVR Series5 and Series5XT, avoid branching to a texture read, as using a sampler in a dynamic branch qualifies as a dependent texture read.

A dependent texture read occurs when the coordinates used to sample the texture depend on some calculation in the shader rather than on a varyings. In a normal texture read, the hardware can fetch texture data before the fragment shader starts, reducing latency from sampling. In dependent texture reads, the texture coordinates can’t be predicted ahead of time, so texture data can’t be pre-fetched, leading to greater latency and stalls. This can have a really noticeable effect on performance.

From PowerVR Series 6 onwards, dependent texture reads are much more efficient, so this isn’t as important for these architectures, but every little performance boost helps when you’re trying to optimise your application.

Try to use branching to skip unnecessary operations

Finally, it is a good idea to use conditional branching to skip unnecessary operations. This will have the greatest impact on performance when there are a significant number of cases where the condition is met.

Optimising shaders for OpenGL ES 3.0

If you’re using OpenGL ES 3.0 and want to optimise any branching in your shaders, it might be worth taking a look at the extension GL_EXT_shader_group_vote.

To illustrate how this extensions works, consider some basic branching like this:

if (condition)
    result = do_fast_path();
    result = do_general_path();

As mentioned before, sets of shader invocations in a graphics core must all execute the same code path. In the example above, if the condition is true for a single invocation in that group then do_fast_path() will be called on that particular invocation. This leaves the rest of the invocations dormant while waiting for do_fast_path() to return. Once do_fast_path() returns a value, the rest of the invocations can call do_general_path().

This is a bit of a pain because the shader is wasting resources by executing both the fast and the general path. Instead, we can modify the above code using the new built-in functions from this extension:

if (allInvocationsEXT(condition))
    result = do_fast_path();
    result = do_general_path();

The function allInvocationsEXT() only returns true if the given condition is met across the entire set of invocations. This is really useful because it will return the value for all invocations in the group, restricting the group to either executing do_fast_path() or do_general_path() but not both.

GL_EXT_shader_group_vote also provides two other built-in functions like alInvocationsEXT(), which return the same value across all invocations in the same group.

These are:

  • anyInvocationEXT(bool value) – This returns true if value is true for at least one of the invocations in the group.
  • allInvocationsEqualEXT(bool value) – This returns true if value is the same for all invocations in the group.

And finally…

For more PowerVR performance recommendations, and other useful developer information, take a look at our regularly-updated website at docs.imgtec.com.

Do feel free to leave feedback through our usual forums or ticketing systems.

You can also follow @tom_devtech on Twitter for developer technology-related news, or @powervrinsider for the latest on PowerVR!

Tom Lewis

Tom Lewis

Tom Lewis is a graduate technical author in the PowerVR Developer Technology team at Imagination. He is responsible for producing documentation to support the PowerVR SDK and Tools, including user manuals and guides. Outside of this, you will probably find him cycling up a hill that is far too steep or catching up on the latest PC game releases.

2 thoughts on “Flow Control on PowerVR – Optimising Shaders”

  1. Is it possible to avoid using group_vote but still skip a branch if all the invocations/threads in a group/wavefront/warp happen to have the same dynamic (not uniform) condition?

    For example, there can be an EXEC thread mask and an EXECZ (indicating EXEC bits are all 0), s_cbranch_execz will skip instructions of a branch based on EXECZ.

  2. “Avoid using discard in conditional branches”

    I think the original advice is avoiding “discard” as much as possible, anyway, if using “discard”, it is always in a branch.


Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

pvrtune complete

What is PVRTune Complete?

PVR Tune Complete highlights exactly what the application is doing at the GPU level, helping to identify any bottlenecks in the compute stage, the renderer, and the tiler.

Read More »
shutterstock 1175807410 1

Vulkan synchronisation and graphics-compute-graphics hazards: Part I

How do you mix and match rasterisation and compute in a modern GPU? In modern rendering environments, there are a lot of cases where a compute workload is used during a frame. Compute is generic (non-fixed function) parallel programming on the GPU, commonly used for techniques that are either challenging, outright impossible, or simply inefficient to implement with the standard graphics pipeline (vertex/geometry/tessellation/raster/fragment).

Read More »


Sign up to receive the latest news and product updates from Imagination straight to your inbox.