PowerVR 'Rogue': Designing an optimal architecture for graphics and GPU compute

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

When designing our PowerVR ‘Rogue’ architecture, all components were reworked and optimised for efficiency (more on this in a future article). Part of that effort came from a deeper consideration of the GPU compute angle. Therefore, in this article, I will focus on just two highlighted key features of the PowerVR Series6 GPU design that is linked to this effort.

PowerVR ‘Rogue’ GPUs feature scalar ALUs for highest compute resources utilization and easy programming

The PowerVR ‘Rogue’ architecture is built around scalar processing engines rather than the vector engines used in older GPU designs. There are numerous benefits in going to a scalar processing architecture – most notably easier optimal software development. This ease of development benefits both our compiler teams (no need for complex and expensive vectorisation efforts at the compiler level). It equally it is far easier for developers, since it no longer matters if they vectorise their algorithms or not. This benefit is illustrated in the graph below:

GPU compute on PowerVR 'Rogue': ALU utilization scalar vs vector architecture

As can be seen in the graph, the scalar architecture does not care if the algorithm is written with scalar ops (R), vec2 ops (RG), vec3 (RGB) or a full vec4 (RGBA), where the vector-based architecture is highly sensitive to vectorisation. With vector-based architectures, the problem of efficiency is shifted to the software developer, rather than tackling efficiency directly through a modern architecture with a scalar design.

This architectural efficiency is essential for optimising image processing algorithms. This is one of the most popular and sensible usages of GPU compute in the mobile segment, where many algorithms reject colour information as a first step, and limit processing to intensity information only. Such an approach on a scalar architecture is no problem at all.

On a poorly-implemented vector architecture however, the developer is faced with 25% of peak performance or the expensive option of vectorising the entire algorithm to process multiple intensity values in one go (e.g. 4 pixels).

Vectorising algorithms may be manageable for simple algorithms but quickly becomes a lot more complex as algorithms commonly mix vector widths thus significantly complicating this effort. Typically developers focus on optimising for the most common dominant architecture and, given current market share ratios, it seems extremely likely that scalar based architectures, like PowerVR ‘Rogue’, will be dominant in the mobile market (not a surprise given the gain in efficiency). Further strengthening this developer focus is the PC market, where compute architectures are also using scalar pipelines. Algorithms ported from this market to mobile will already have been optimised for scalar and not vector based GPU designs.

PowerVR ‘Rogue’ GPUs have improved support for local memory

Compute APIs have the opportunity to expose different memory types, which, depending on the implementation, may provide different performance levels. Typically this is referred to as local memory (fast memory) and global memory.

When writing algorithms using just global memory, you just address data as you would normally do, and access goes to system memory through a standard cache infrastructure. With local memory, however, algorithms can be rewritten to pre-load data into the local memory (a kind of on-chip cache). Then the algorithm accesses this fast local memory store during its compute processing, and at the end, the results again are burst-written to system memory.

It should already be clear that the latter approach sounds far more bandwidth- and power-effective, as data is fetched into local memory once followed by making all accesses on-chip. This is unlike the first approach where it is all left up to chance (any cache implementation depends on luck: if you are lucky, the data is still in the cache; if you are unlucky, the data has already been flushed by other data accesses and hence you need to re-fetch).

If you remember our graphics approach of Tile Based Deferred Rendering (TBDR) from other posts, you will remember that by using our tile sorting, we ensure that caches become 100% effective (see link). It comes as no surprise then that Imagination has implemented the equivalent concept of compute using the efficiency of fast on-chip memory.

GPU compute on PowerVR 'Rogue': memory hierarchy in OpenCL

Within the PowerVR ‘Rogue’ architecture, there are numerous optimisations linked to compute usage scenarios. We also continue to make our architecture more efficient and effective by studying actual practical mobile compute use cases as they come to market from third parties.

If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationTech) and subscribe to our blog feed.

Kristof Beets

Kristof Beets

Kristof Beets is Senior Business Development Manager for PowerVR Graphics at Imagination Technologies where he leads the in-house demo development team and works on product messaging. He has a background in electrical engineering and received a master's degree in artificial intelligence. Prior to joining the Business Development Group he worked on SDKs and tools for both PC and mobile products as a member of the PowerVR Developer Relations Team. Previous work has been published in ShaderX2, X5 & X6, ARM IQ Magazine, and online by the Khronos Group, Beyond3D and 3Dfx Interactive. Kristof has spoken at GDC, SIGGRAPH, Embedded Technology, MWC and too many other conferences to remember.

1 thought on “PowerVR 'Rogue': Designing an optimal architecture for graphics and GPU compute”

  1. Wow! This was a fantastic read, and I learned something about Scalar vs. Vector ALUs (which makes complete sense in hindsight). Imgtec continues to lead the charge in terms of efficiency, and I simply can’t wait to see what Rogue is capable of doing. Thank you so much for this.

Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

benny.har-even@imgtec.com
Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

Celebrating the 20th anniversary of Dreamcast and PowerVR

It was 20 years ago today (well yesterday, strictly speaking, but close enough) that Sega released the Dreamcast onto the world in the US, with the Japanese and European launches following on later that year. While its reign was short

Monitor window

Balancing GPU workloads on PowerVR hardware

This post is another summary from our PowerVR Performance Recommendations focused on eliminating performance bottlenecks by balancing different GPU workloads.

Stay up-to-date with Imagination

Sign up to receive the latest news and product updates from Imagination straight to your inbox.

  • This field is for validation purposes and should be left unchanged.

Subscribe to our newsletter

  • This field is for validation purposes and should be left unchanged.