As discussed in my previous blog posts and articles, Imagination’s engineering force is dedicated to efficiency. The PowerVR ‘Rogue’ architecture is our first architecture designed for compute. Of course, efficiency and common sense were liberally applied to the design process of this functionality. This approach helps us to deliver solutions that match true market requirements, while keeping in mind that efficiency and power consumption are the two most critical factors to consider as we design our PowerVR mobile GPU family.

Balancing features with performance and power

Blindly implementing every possible feature is a simplistic approach to hardware design, as it is all too easy to forget about the cost impact (e.g. silicon area) or, even more importantly for mobile designs, power consumption. Hence it is critical to find the right balance between feature set – linked to industry standard APIs (more on this later) – and power consumption. The same is true for performance. Designing a GPU with stellar performance but little care about cost or power consumption leads to a product that fails to match essential market requirements.

In our GPU IP design flow, every effort is made to provide the best balance, investing where required, but holding back where a feature is likely to go unused (which would end up as an area and power consumption cost).

PowerVR Series6 Rogue GPU block diagram

As our architects looked at the existing GPU compute APIs for the workstation PC market, several features jumped out as poor matches for mobile battery-driven systems. This insight was fed back into compute API design efforts, resulting in the creation of optimal profiles for mobile systems. A good example of such an optimised feature set is the OpenCL Embedded Profile (EP) API from the Khronos Group. These mobile optimised API variants skip niche features which are only of use for a very small set of usage scenarios (if any) or which are only applicable to non-mobile usage scenarios such as scientific and academic compute (as discussed before, mobile compute must be practical compute).

Here are just a couple of examples to illustrate this and the link to the ever critical power consumption:

First questionable feature: 64-bit floating point (FP64) support. The most common floating point representation is 32-bit wide (FP32) and is known as single precision. For more extreme usage scenarios which require very large precise numbers, a double precision 64-bit format exists. It should come as no surprise that moving from 32-bit to 64-bit arithmetic units is not something which is free in terms of silicon area or power consumption. As a result, this raises red flags in terms of a mobile focussed GPU design – do we really need 64 bits? In the case of FP64, the answer is actually simple.

FP64 is an optional feature in every compute API, even those for high-end, power-guzzling PC systems. As a result, adding dedicated hardware to enable FP64 in a mobile design is clearly a flawed decision. After all, why add something to a power-sensitive product when it is even optional for desktop products?

A second questionable feature: precise rounding mode support. Rounding modes control the behaviour of the last bit of precision in your number representation. It’s a bit like rounding fractional numbers to the nearest integer number. If we have 12.75 as an example value, and our numerical representation does not support fractions, do we round this to 12.7 or to 12.80? And an additional question: do we even need to worry about the rounding to 12.7 or 12.8? Obviously this question depends very much on the usage scenario – for example if we are looking at money 12.7 or 12.8 million versus 12.7 or 12.8 cents makes a big difference.

Hence this is where power consumption impact and practicality of the algorithms come into play. Rounding is not as trivial as it may sound. Adding more complexity costs more in terms of silicon area and is also linked to higher power consumption.

Additionally, if we already have a large number of representative numbers, does that last bit of fractional rounding really matter? For example, if a colour channel gets the colour 232 or 233, can we even really tell? Looking at numerous desktop compute applications with a link to practical usage scenarios for mobile devices, our teams found that rounding makes little or no difference to the practical results of algorithms. A more practical indication: image/video processing typically bring in 8 bits of colour data per channel, this data must be processed to generate another 8 bits output image. Doing this processing using 32 bits floating point offers far more precision than the input or output data needs, obviously this means that investing in accuracy of the rounding precision is a waste of silicon area and, worse, a waste of valued power budget.

While small differences are possible due to rounding, we generally find that numerical accuracy in operations always carries variable errors anyway. We actually found this out the hard way, as we were running through compute API conformance tests on different systems. Our GPU passed conformance on an Intel x86-based system, but when running on an ARM-based system, that same GPU failed conformance. Odd or not?

Actually not odd at all, as the reason is small differences in numerical precision between the Intel x86 implementation and the ARM implementation. So if two of the leading CPU architectures cannot even return the same reference values, why do we need to bother about the last bit of rounding precision in general? In most situations, depending on the last grain of precision puts you in an unstable area of the algorithm, as you need every last bit of precision to generate the correct result. This borderline type of calculation leads to instability, and is what often can be seen as flickering and jittering in images or animations. It is something you want to stay away from, and not depend on. Given all of the above, it’s clear that on the rounding mode subject, going for lower power is clearly the better choice for mobile GPUs.

Just to clarify even further, from a design point of view, GPUs have tens if not hundreds of compute engines (ALUs). Anything you add to these units in terms of functionality makes them more complex. This translates to extra silicon area and extra power consumption, which gets multiplied by the number of engines in the overall GPU. Adding FP64 support, adding more advanced rounding modes, adding exception handling, adding multi-precision, etc. – every bit of complexity you add ends up costing you. ALU design is definitely an area where “keep it clean and simple” (or “lean and mean”) applies in terms of silicon area, and also, critically, applies in terms of power consumption.

Looking across the overall feature set of compute APIs, there are however numerous other features which can be implemented with minimal or no silicon area and/or increase in power consumption. Many of these are still optional in mobile compute APIs, but for ease of porting and application compatibility, these optional features should be supported – especially as some can actually lead to reduced power consumption. This is illustrated in the table below:

GPU compute_OpenCL optional features

Through our extensive ecosystem, as well as an early access program, Imagination’s feature set choices are being confirmed as the right choices. Even in third party benchmark applications, claimed to be desktop profile, we note that all of the effects implemented do not actually depend on niche desktop features nor on extreme precision, as all tests execute and run well on multiple embedded profile implementations. Internal tests also confirm this; we have developed several practical usage scenarios for mobile, focussing on image processing. We’ve also confirmed on alternative usage scenarios (checking compatibility with our power streamlined feature set), such as fluid dynamics calculations, cloth and rigid body physics calculations, and even video decoders. None of these have proven to have any requirement for desktop profile feature sets as shown below:

OpenCL demos

Stay tuned for my next article where I will explain how we’ve designed PowerVR ‘Rogue’ to be an optimal architecture for GPU compute.
If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationPR) and subscribe to our blog feed.

Comments

  • verdantchile

    I think the decisions of which features are supported seems quite reasonable there.

    Since power/heat tends to be the limiting factor in designs more often than die area these days, I wonder if increasingly higher feature support will be an industry trend over time (beyond the obvious advancement with supporting newer APIs, of course) considering the area saved per ALU by not supporting a future feature might allow for more (slightly simpler) ALUs to be included but might not free up the power budget to actually drive those extra ALUs. It’s all a balancing act of course, and ImgTec has been especially good at that obviously.

    • Hi,

      You are making a valid point. When it comes to compute, we might see some desktop-level features making their way into mobile just like the desire for console-like quality in mobile has pushed the industry to offer better and better graphics hardware with generation.

      The trick is to understand when these things really matter and implement them in a way which is reasonable and balanced from a power efficiency point of view. GPU compute and heterogeneous processing are new areas for mobile, therefore the aim should always be to offer the best possible performance for the power envelope of a device that you carry in your back pocket.

      Regards,
      Alex.

  • sikulas

    We need or not of this or that feature – it is also depends on the occupied silicon space by any other parts of the GPU. I think you have tried to reduce the chip area of the GPU/PowerVR 6. You used clusters instead of cores etc. But… still, the area is so LARGE that you have to give up some features (FP64).

    Maybe I’m wrong. But such impression. Again, I would like to see the practical figures/numbers instead of words/letters.

    • Hi,

      There are a few aspects that I think you are overlooking here. The article tries to point to features that are overkill in mobile. Obviously silicon area is something we take into account when designing our GPU cores. To address this, we’ve significantly improved scalability (there are six different version of PowerVR Series6 GPUs that map to different market requirements), created area optimized versions of the ‘Rogue’ architecture (PowerVR G6100, G6200 and G6400) and are working closely with companies like Synopsys and TSMC to optimise for power, performance and area.

      The recently released Design Optimisation Kits (DOKs) deliver substantial silicon PPA gains while reducing design cycle times. Since you are asking for numbers, customers leveraging the DOK for PowerVR G6100 can achieve up to a 25% reduction in dynamic power and up to 10% area savings, as well as up to 30% improvement in implementation turnaround time through a tuned design flow.

      Best regards,
      Alex.

      • sikulas

        1. Thanks for numbers (percents). And I read that earlier in news:
        https://www.imgtec.com/news/Release/index.asp?NewsID=788
        But these percents mean nothing in reality, because we do not know about the numbers, which represent 100%. OK. If the numbers of the new DOK is “under NDA”, so can you write the numbers of the first DOK (G6200, for example). The real numbers is power in mW, area in mm^2 etc. (and not %). Or can you say when and where the real numbers of PowerVR 6 will be officially announced (without unnecessary letters/words).

        2. I understand what is written in the article. But the article does not mean that these features (like FP64) will never be in the PowerVR’s/Imgtec’s products? Am I right?
        So, it is important that these features can not be used effectively in the mobile segment today only (and may be Imgtec only 🙂 and can be used later (10-14nm, for example).

      • sikulas

        1. Thanks for numbers (percents). And I read that earlier in news. But these percents mean nothing in reality (for me), because I do not know about the numbers, which represent 100%. OK. If the numbers of the new DOK is “under NDA”, so can you write the numbers of the first DOK (fo example in G6200). The real numbers is power, area etc. (and not %). Or can you say when and where the real numbers of PowerVR 6 will be officially announced (without unnecessary words and percents).

        2. I understand what is written in the article. But the article does not mean that these features (like FP64) will never be in the PowerVR’s/Imgtec’s products? Am I right?
        So, it is important that these features can not be used effectively in the mobile segment today only and can be used later (10-14nm, for example).

  • Again, I appreciate your interest in finding out more about our PowerVR Series6 GPUs but PPA numbers are disclosed only under NDA because the information is of highly competitive nature to our customers.

    The article makes the point for what is relevant for GPU compute now and why the ‘Rogue’ architecture is the best choice for these parallel workloads. In terms of supporting more functionality, the future always remains open.

    Regards,
    Alex.

  • Sean Lumly

    I really am amused by the extremely direct language that compares Imgtec technologies against the intimate aspects of of their competitors implementations (their incredibly small pool of serious competitors, I might add), although the names are suspiciously missing. 😀

    In any event, the trade-offs make a lot of sense. Less area spent on features that will not be used (and can be emulated), means either smaller, less expensive dies, or comparatively more area dedicated to those features that will be frequently used. It possibly means both.

    But I have a question: Is it true to assume that having more area dedicated to features that will be frequently used will also result in more silicon being ‘lit’ at any one time and thus larger power requirements? Alternately can it be said that same area could be used for more efficient computation? Is this true or am I a few degrees off of north?

    • Hi Sean,

      Here’s one simple example that goes against your assumption: the PowerVR G6x30 GPUs add incremental area for features such as image lossless compression. For render targets, this provides a typical 2:1 compression rate, but it can be much higher, depending on the frame being compressed. The idea of adding more silicon in this case is to actually save on power consumption by reducing memory bandwidth.

      Kristof’s point in this article is that sometimes silicon is added where it shouldn’t, resulting in higher area and power consumption.

      Best regards,
      Alex.

      • Sean Lumly

        Got it! I was actually eluding to the potential efficiency that you mentioned with the “alternatively” statement, though it could have been articulated with more clarity. Thanks!

        Last question: How many cycles does a sqrt, div, or trig function typically consume on the rogue GPU with a 3D API (eg. OpenGL) and during GPU compute with OpenCL/renderscript? I have searched for this info, but sadly, there is little that I have found..

        • Hi Sean,

          I believe some of that information will be available in the programing guidelines for PowerVR Series5XT GPUs. For the ‘Rogue’ architecture, things might be a little different.

          I’ll double check and come back with an answer soon.

          Regards,
          Alex.

  • TWatcher

    Its naive to think that IMG or any IP provider is going to state the area used by its IP implementations. Even if there was a single figure per IP (and there won’t be because each licensee will implement differently), it is one of the key competitive metrics, along with performance per watt, and it won’t be in the public domain. All we can do is wait for a licensee to implement, and then for someone like chipworks to do its stuff, xray the chip, identify the graphics IP block and come out with an estimate.

    • sikulas

      TWatcher,

      Why not??? When we get the finished chip, we can open the lid of chip and see the real area occupied by the PowerVR 6. So why hide it now?

      The only reason I can see – it’s a real area of “DOK” (even G6100) is a really quite LARGE now.

      Besides, I think the real area occupied by the PowerVR 6 in the different chips (like MediaTek’s or Samsung’s etc.) can NOT be very different, because the tech. process is a similar now (28nm), and the “concept” is similar too (6 series).

      • I don’t know where and how you’ve heard these rumors but they are false. I understand your concerns, but rest assured we remain highly competitive in many aspects, including area.

        Going beyond the NDA argument, providing exact area figures to end-users is difficult because the RTL has various parameters and possible configurations. Giving estimates would create confusion because, without seeing the actual RTL, understanding these numbers can be challenging.

        We usually sit down with our customers and help them decide what works best for them and PowerVR Series6 has had a record number of licensees. More on this here:

        https://www.imgtec.com/News/Release/index.asp?NewsID=734

        Regards,
        Alex.

        • Rogue_lover

          Whilst I buy the point that Rogue has had a record number of licenses not all of these will be able to ship in volumes originally anticipated – TI, ST-E, Renesas come to mind. This may be construed as bad news however the reality to me appears that you have dominant positioning with Mediatek and ofcourse Apple. My point is will you see the full range of current ip exploited 6100-6630 via the relationships that matter or with it be just one or two Rogue designs that ship in volume e.g. 6400 & 6100 perhaps. Your investor info states approx. 20 SoCs in design? I also read that you have over 10 Rogue licensees so I may be simple in my thinking as you have unannounced licenses which hopefully we can here more about soon?

          Also when can we expect to here more on Series 6XT? Will it be optimised to work with MIPS CPUs? Assume this is where you plan to get close to 1Tflops performance?

  • Roninja

    Interesting article Kristof – just wonder if those infamous benchmarks will expose the lack of certain features to the detriment of IMG? Is that a valid concern of something you are actively addressing. Clearly I would like to see Rogue reestablish PowerVR at the top of the benchmark leaderboard v Qualcomm and even ARM, Nvidia….

    • Hi,

      We are working with benchmark suppliers as well as analysts and the press to educate them on how to interpret benchmarking data correctly.

      For example, you would expect the result of a benchmark run showing ~28 fps to be better than one showing ~20 fps. In fact, the two can have a very different behavior; that ~28fps number might be an average of 1/2 time spent at 40fps and 1/2 time at 16fps, which clearly provides an inferior user experience compared to a smooth 24fps throughout the run.

      Benchmarking is a tricky subject and a careful analysis of the meaning behind the number should always be taken into account.

      Regards,
      Alex.