Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

Most modern GPUs don’t like to share data they own – daring to poke the bear will invariably incur a cost. This includes the PowerVR Tile Based Deferred Rendering (TBDR) architecture, which keeps a number of frames in-flight so that vertex, fragment and GPU compute tasks can be dynamically scheduled for execution. For an application to be optimal, a renderer must be written with this parallelization in mind.

The key question is – what happens when an application updates data that is still being used by the GPU? The following sections detail the common pitfalls when dealing with dynamic texture and vertex data on platforms with PowerVR Series5 and Series5XT GPUs, and describe how these pitfalls can be avoided.

Texture updates

When an application updates a texture, the driver will check if there are any unresolved renders that require the existing data.

Optimize your renderer - 01 - texture ghosting Texture ghosting

In most cases where the texture is still needed, the driver will take an optimized path and cache the modified data until outstanding renders complete. If the driver is unable to take this path (i.e. the cache is already full), a copy will be created so the original texture can be updated while the duplicate is used by incomplete renders (thus allowing the GPU to continue without interruption). The process of duplicating textures is referred to as “ghosting”.

How does ghosting affect me?

The aim of ghosting is to avoid expensive stalls that would occur if a render had to be kicked every time a texture is updated. However, a significant amount of texture ghosting can cause the driver to run out of memory. Applications that hit this issue tend to be regularly updating texture atlases, for example map data in navigation software and glyphs, particularly for languages with large character sets (such as Mandarin).

In a worst case scenario, an application may be updating a texture, using it, updating it and using it repeatedly within a single frame. When an application does this, the driver will most likely create a ghost for every version of that texture that is required by a render, which means a single texture has the potential to cause a GL_OUT_OF_MEMORY error by itself!

When ghosting occurs, the entire texture will be duplicated. This is done for performance reasons, as tracking modifications and sampling textures correctly would have a significant overhead.

So…what can I do about this?

Rule of thumb: Avoid texture updates

To retain as much parallelization as possible, textures uploaded to the GPU should be seen as read-only blocks of data. With this in mind, it’s possible to refactor many common use cases to avoid the cost of ghosting.

Optimizing a sprite renderer

A common case that can be optimized is a sprite renderer in a 2D game.

Optimize your renderer - 02 - Unoptimized texture atlasUn-optimized texture atlas

Many developers will use a large texture atlas that contains all of the sprites they currently need (like in the figure above). This seems like a sensible idea at first, as using the smallest possible number of texture atlases enables efficient batching to keep the number of draw calls to a minimum. However, when a region of the texture atlas is updated, the driver will have to ghost the entire texture. If this is happening frequently within a small number of frames, the memory required for ghosted textures may become a problem.

Optimize your renderer - 03 - Optimized texture atlases

Optimized texture atlases

To combat this, large atlases can be broken down into a number of smaller atlases (see above). The contents of each small atlas can then be grouped by association in such a way that the atlases will be touched as infrequently as possible and, when they are updated, less memory will be required for ghosting. As an example, a 2D game could place persistent characters into one atlas, level-specific sprites into another and all other sprites into a miscellaneous atlas.

In all cases, sprite updates should be batched so each atlas is touched as little as possible.

If an application is still running out of memory, the frame-rate can be limited. Doing so will reduce the likelihood of ghosting, as there is less chance of an unresolved render needing the data.

If this does not solve the problem, an application can force the render to serialize. This will remove ghosting entirely as there will be no outstanding renders, but this approach is likely to cause noticeable performance degradation. For this reason, serialization should only be considered as a last resort.

VBO updates

When VBOs are updated, the driver will kick any outstanding vertex processing (TA) tasks that rely on the existing data. When the task is kicked, the driver will block the application’s GL thread until the vertex processing has completed and it’s safe to update the contents of the VBO.

Optimize your renderer - 04 - VBO update blockingVBO update blocking

The reason a different approach is taken for VBOs is that the cost of kicking a TA task and waiting for it to complete is much cheaper than an equivalent approach for textures. An additional benefit of this solution is that the GPU does not get interrupted, so it can continue to process its workload as fast as possible.

How can I avoid the stall?

Rule of thumb: Avoid VBO updates

Similar to the advice for textures, it’s best to think of VBO’s as read-only blocks of data. For this reason, VBOs should only be used for static attribute data. Dynamic attributes that change on a per-frame basis should be uploaded directly to GL instead of modifying VBOs. Doing so will avoid the stall.

Optimize your renderer - 05 - VBO circular bufferVBO circular buffer

In situations where attributes need to be updated but may be reused for a number of frames, a circular buffer of VBOs can be used (see figure above). A circular buffer consisting of n VBOs (where n is the number of frames in flight) should be sufficient to pair a VBO with each in-flight frame, and thus avoid blocking (as VBOs will only be updated when they are not being accessed by the GPU). Although this approach avoids the stall, it will increase the memory requirements of your application. If you are already approaching GL_OUT_OF_MEMORY territory, the slight overhead of the stall may be a more efficient option than an out-of-memory fall back.


To ensure your application’s 3D graphics are as efficient as possible, rendering code should be designed in such a way that the GPU will not be disturbed. This will give great performance, and it will also lead to a well-designed solution that is easier to maintain and port to new platforms.

If you’d like to learn more about the PowerVR architecture and best practices when writing graphics applications, check out our Performance Recommendations and PowerVR Series5 Architecture Guide for Developers documents. If you have any questions about this tutorial, you can contact our DevTech support team on Imagination’s PowerVR Insider dedicated forum. Remember to follow us on Twitter (@ImaginationTech and @PowerVRInsider) and subscribe to our blog.

Joe Davis

Joe Davis

Joe Davis leads the PowerVR Graphics developer support team. He and his team support a wide variety of graphics developers including those writing games, middleware, UIs, navigation systems, operating systems and web browsers. Joe regularly attends and presents at developer conferences to help graphics developers get the most out of PowerVR GPUs. You can follow him on Twitter @joedavisdev.

1 thought on “Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms”

Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

bseries imgic technology

Back in the high-performance game

My first encounter with the PowerVR GPU was helping the then VideoLogic launch boards for Matrox in Europe. Not long after I joined the company, working on the rebrand to Imagination Technologies and promoting both our own VideoLogic-branded boards and those of our partners using ST’s Kyro processors. There were tens of board partners but only for one brief moment did we have two partners in the desktop space: NEC and ST.

Read More »
b series hero banner 2

IMG B-Series – a multi-core revolution for a new world

B-Series uses multi-core to deliver an incredible 33 core variations for the widest range of options at all levels of performance points. From the smallest IoT cores up to the mid-range desktop equivalent B-Series an outperform mid-range next-gen consoles. Learn more in this blog post.

Read More »
pvrtune complete

What is PVRTune Complete?

PVR Tune Complete highlights exactly what the application is doing at the GPU level, helping to identify any bottlenecks in the compute stage, the renderer, and the tiler.

Read More »


Sign up to receive the latest news and product updates from Imagination straight to your inbox.