Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms

Most modern GPUs don’t like to share data they own – daring to poke the bear will invariably incur a cost. This includes the PowerVR Tile Based Deferred Rendering (TBDR) architecture, which keeps a number of frames in-flight so that vertex, fragment and GPU compute tasks can be dynamically scheduled for execution. For an application to be optimal, a renderer must be written with this parallelization in mind.

The key question is – what happens when an application updates data that is still being used by the GPU? The following sections detail the common pitfalls when dealing with dynamic texture and vertex data on platforms with PowerVR Series5 and Series5XT GPUs, and describe how these pitfalls can be avoided.

Texture updates

When an application updates a texture, the driver will check if there are any unresolved renders that require the existing data.

Optimize your renderer - 01 - texture ghosting Texture ghosting

In most cases where the texture is still needed, the driver will take an optimized path and cache the modified data until outstanding renders complete. If the driver is unable to take this path (i.e. the cache is already full), a copy will be created so the original texture can be updated while the duplicate is used by incomplete renders (thus allowing the GPU to continue without interruption). The process of duplicating textures is referred to as “ghosting”.

How does ghosting affect me?

The aim of ghosting is to avoid expensive stalls that would occur if a render had to be kicked every time a texture is updated. However, a significant amount of texture ghosting can cause the driver to run out of memory. Applications that hit this issue tend to be regularly updating texture atlases, for example map data in navigation software and glyphs, particularly for languages with large character sets (such as Mandarin).

In a worst case scenario, an application may be updating a texture, using it, updating it and using it repeatedly within a single frame. When an application does this, the driver will most likely create a ghost for every version of that texture that is required by a render, which means a single texture has the potential to cause a GL_OUT_OF_MEMORY error by itself!

When ghosting occurs, the entire texture will be duplicated. This is done for performance reasons, as tracking modifications and sampling textures correctly would have a significant overhead.

So…what can I do about this?

Rule of thumb: Avoid texture updates

To retain as much parallelization as possible, textures uploaded to the GPU should be seen as read-only blocks of data. With this in mind, it’s possible to refactor many common use cases to avoid the cost of ghosting.

Optimizing a sprite renderer

A common case that can be optimized is a sprite renderer in a 2D game.

Optimize your renderer - 02 - Unoptimized texture atlasUn-optimized texture atlas

Many developers will use a large texture atlas that contains all of the sprites they currently need (like in the figure above). This seems like a sensible idea at first, as using the smallest possible number of texture atlases enables efficient batching to keep the number of draw calls to a minimum. However, when a region of the texture atlas is updated, the driver will have to ghost the entire texture. If this is happening frequently within a small number of frames, the memory required for ghosted textures may become a problem.

Optimize your renderer - 03 - Optimized texture atlases

Optimized texture atlases

To combat this, large atlases can be broken down into a number of smaller atlases (see above). The contents of each small atlas can then be grouped by association in such a way that the atlases will be touched as infrequently as possible and, when they are updated, less memory will be required for ghosting. As an example, a 2D game could place persistent characters into one atlas, level-specific sprites into another and all other sprites into a miscellaneous atlas.

In all cases, sprite updates should be batched so each atlas is touched as little as possible.

If an application is still running out of memory, the frame-rate can be limited. Doing so will reduce the likelihood of ghosting, as there is less chance of an unresolved render needing the data.

If this does not solve the problem, an application can force the render to serialize. This will remove ghosting entirely as there will be no outstanding renders, but this approach is likely to cause noticeable performance degradation. For this reason, serialization should only be considered as a last resort.

VBO updates

When VBOs are updated, the driver will kick any outstanding vertex processing (TA) tasks that rely on the existing data. When the task is kicked, the driver will block the application’s GL thread until the vertex processing has completed and it’s safe to update the contents of the VBO.

Optimize your renderer - 04 - VBO update blockingVBO update blocking

The reason a different approach is taken for VBOs is that the cost of kicking a TA task and waiting for it to complete is much cheaper than an equivalent approach for textures. An additional benefit of this solution is that the GPU does not get interrupted, so it can continue to process its workload as fast as possible.

How can I avoid the stall?

Rule of thumb: Avoid VBO updates

Similar to the advice for textures, it’s best to think of VBO’s as read-only blocks of data. For this reason, VBOs should only be used for static attribute data. Dynamic attributes that change on a per-frame basis should be uploaded directly to GL instead of modifying VBOs. Doing so will avoid the stall.

Optimize your renderer - 05 - VBO circular bufferVBO circular buffer

In situations where attributes need to be updated but may be reused for a number of frames, a circular buffer of VBOs can be used (see figure above). A circular buffer consisting of n VBOs (where n is the number of frames in flight) should be sufficient to pair a VBO with each in-flight frame, and thus avoid blocking (as VBOs will only be updated when they are not being accessed by the GPU). Although this approach avoids the stall, it will increase the memory requirements of your application. If you are already approaching GL_OUT_OF_MEMORY territory, the slight overhead of the stall may be a more efficient option than an out-of-memory fall back.

Conclusion

To ensure your application’s 3D graphics are as efficient as possible, rendering code should be designed in such a way that the GPU will not be disturbed. This will give great performance, and it will also lead to a well-designed solution that is easier to maintain and port to new platforms.

If you’d like to learn more about the PowerVR architecture and best practices when writing graphics applications, check out our Performance Recommendations and PowerVR Series5 Architecture Guide for Developers documents. If you have any questions about this tutorial, you can contact our DevTech support team on Imagination’s PowerVR Insider dedicated forum. Remember to follow us on Twitter (@ImaginationTech and @PowerVRInsider) and subscribe to our blog.

1 thought on “Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms”

Leave a Comment

Search by Tag

Search for posts by tag.

Search by Author

Search for posts by one of our authors.

Featured posts
Popular posts

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

benny.har-even@imgtec.com
Tel: +44 (0)1923 260 511

Related blog articles

on stage in China

PVRIC4 a hit at ICCAD 2018 in China

Imagination’s PVRIC4 image compression tech garnered plenty of attention at the recent ICCAD China 2018 symposium, which took place on 29th and 30th November at the Zhuhai International Convention & Exhibition Centre, China. The annual event focusses on integrated circuit

The ultimate embedded GPUs for the latest applications

Introducing PowerVR Series9XEP, Series9XMP, and Series9XTP As Benjamin Franklin once said, only three things in life are certain: death, taxes and the ongoing rapid advancement of GPUs for embedded applications*. Proving his point, this week, Imagination has once again pushed

Opinion: the balance between edge and cloud.

Simon Forrest explains how embedded chips can meet the challenge of delivering true local AI processing. GPUs and NNAs are rapidly becoming essential elements for AI on the edge. As companies begin to harness the potential of using neural networks

DJI Mavic 2 closed

Partner interview series: DJI – flying high

DJI is a name now synonymous with drones, thanks to an estimated 74% market share across consumer, professional and enterprise markets. In the second of our ‘quick chat’ interview series, we speak to we talk to Charlie Sun, R&D Director at DJI to find

Stay up-to-date with Imagination

Sign up to receive the latest news and product updates from Imagination straight to your inbox.

  • This field is for validation purposes and should be left unchanged.