PowerVR 'Rogue': Designing an optimal architecture for graphics and GPU compute

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

When designing our PowerVR ‘Rogue’ architecture, all components were reworked and optimised for efficiency (more on this in a future article). Part of that effort came from a deeper consideration of the GPU compute angle. Therefore, in this article, I will focus on just two highlighted key features of the PowerVR Series6 GPU design that is linked to this effort.

PowerVR ‘Rogue’ GPUs feature scalar ALUs for highest compute resources utilization and easy programming

The PowerVR ‘Rogue’ architecture is built around scalar processing engines rather than the vector engines used in older GPU designs. There are numerous benefits in going to a scalar processing architecture – most notably easier optimal software development. This ease of development benefits both our compiler teams (no need for complex and expensive vectorisation efforts at the compiler level). It equally it is far easier for developers, since it no longer matters if they vectorise their algorithms or not. This benefit is illustrated in the graph below:

GPU compute on PowerVR 'Rogue': ALU utilization scalar vs vector architecture

As can be seen in the graph, the scalar architecture does not care if the algorithm is written with scalar ops (R), vec2 ops (RG), vec3 (RGB) or a full vec4 (RGBA), where the vector-based architecture is highly sensitive to vectorisation. With vector-based architectures, the problem of efficiency is shifted to the software developer, rather than tackling efficiency directly through a modern architecture with a scalar design.

This architectural efficiency is essential for optimising image processing algorithms. This is one of the most popular and sensible usages of GPU compute in the mobile segment, where many algorithms reject colour information as a first step, and limit processing to intensity information only. Such an approach on a scalar architecture is no problem at all.

On a poorly-implemented vector architecture however, the developer is faced with 25% of peak performance or the expensive option of vectorising the entire algorithm to process multiple intensity values in one go (e.g. 4 pixels).

Vectorising algorithms may be manageable for simple algorithms but quickly becomes a lot more complex as algorithms commonly mix vector widths thus significantly complicating this effort. Typically developers focus on optimising for the most common dominant architecture and, given current market share ratios, it seems extremely likely that scalar based architectures, like PowerVR ‘Rogue’, will be dominant in the mobile market (not a surprise given the gain in efficiency). Further strengthening this developer focus is the PC market, where compute architectures are also using scalar pipelines. Algorithms ported from this market to mobile will already have been optimised for scalar and not vector based GPU designs.

PowerVR ‘Rogue’ GPUs have improved support for local memory

Compute APIs have the opportunity to expose different memory types, which, depending on the implementation, may provide different performance levels. Typically this is referred to as local memory (fast memory) and global memory.

When writing algorithms using just global memory, you just address data as you would normally do, and access goes to system memory through a standard cache infrastructure. With local memory, however, algorithms can be rewritten to pre-load data into the local memory (a kind of on-chip cache). Then the algorithm accesses this fast local memory store during its compute processing, and at the end, the results again are burst-written to system memory.

It should already be clear that the latter approach sounds far more bandwidth- and power-effective, as data is fetched into local memory once followed by making all accesses on-chip. This is unlike the first approach where it is all left up to chance (any cache implementation depends on luck: if you are lucky, the data is still in the cache; if you are unlucky, the data has already been flushed by other data accesses and hence you need to re-fetch).

If you remember our graphics approach of Tile Based Deferred Rendering (TBDR) from other posts, you will remember that by using our tile sorting, we ensure that caches become 100% effective (see link). It comes as no surprise then that Imagination has implemented the equivalent concept of compute using the efficiency of fast on-chip memory.

GPU compute on PowerVR 'Rogue': memory hierarchy in OpenCL

Within the PowerVR ‘Rogue’ architecture, there are numerous optimisations linked to compute usage scenarios. We also continue to make our architecture more efficient and effective by studying actual practical mobile compute use cases as they come to market from third parties.

If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationTech) and subscribe to our blog feed.

Kristof Beets

Kristof Beets

Kristof Beets is the senior director of product management for PowerVR Graphics at Imagination Technologies where he drives the product roadmaps to ensure alignment with market requirements. Prior to this, he was part of the business development group and before this, he led the in-house demo development team and the competitive analysis team. His engineering background includes work on SDKs and tools for both PC and mobile products as a member of the PowerVR Developer Relations Team. His work has been published in many guides game and graphics programmers, such as Shader X2, X5 and X6, ARM IQ Magazine, and online by the Khronos Group, Beyond3D.com and 3Dfx Interactive. Kristof has a background in electrical engineering and received a Master's degree in artificial intelligence. He has spoken at GDC, SIGGRAPH, Embedded Technology, MWC and many other conferences.

1 thought on “PowerVR 'Rogue': Designing an optimal architecture for graphics and GPU compute”

  1. Wow! This was a fantastic read, and I learned something about Scalar vs. Vector ALUs (which makes complete sense in hindsight). Imgtec continues to lead the charge in terms of efficiency, and I simply can’t wait to see what Rogue is capable of doing. Thank you so much for this.


Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

bseries imgic technology

Back in the high-performance game

My first encounter with the PowerVR GPU was helping the then VideoLogic launch boards for Matrox in Europe. Not long after I joined the company, working on the rebrand to Imagination Technologies and promoting both our own VideoLogic-branded boards and those of our partners using ST’s Kyro processors. There were tens of board partners but only for one brief moment did we have two partners in the desktop space: NEC and ST.

Read More »
b series hero banner 2

IMG B-Series – a multi-core revolution for a new world

B-Series uses multi-core to deliver an incredible 33 core variations for the widest range of options at all levels of performance points. From the smallest IoT cores up to the mid-range desktop equivalent B-Series an outperform mid-range next-gen consoles. Learn more in this blog post.

Read More »
pvrtune complete

What is PVRTune Complete?

PVR Tune Complete highlights exactly what the application is doing at the GPU level, helping to identify any bottlenecks in the compute stage, the renderer, and the tiler.

Read More »


Sign up to receive the latest news and product updates from Imagination straight to your inbox.