GPU performance analysis on Fantasy Warrior 3D with PVRTune

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

In my previous post I provided a brief introduction about our target platform (Cocos2d-x on PowerVR GPUs), and a set of profiling rule and tools. In this post I will demonstrate how to work with PVRTune to identify performance bottlenecks.

Setup environment

If you are interested in exploring the Fantasy Warrior 3D game, you can find a video recording here and download the game code here.

1-Fantasy Warrior 3D - Cocos2d-x

The first step is to build the game by using the instructions file. After that we can use the PVRHub and PVRTune to record the performance analysis file; I’ve uploaded my recording files here. These files have been recorded in the following environment:

  • Hardware information
    • Device name: Onda V989 tablet (Allwinner A80 chip, PowerVR Series6 G6230 GPU)
  • Software information
    • Android version: 4.4.2
    • Driver info: Version Rogue_DDK_Android_RSCompute rogueddk 1.4@3234138 (release) sunxi_android
    • PVRTuneDeveloper: v14.111.1 (SDK build 3.5@3530647)
    • PVRPerfServerDeveloper: v14.111.1 (SDK build 3.5@3533642)
    • PVRTrace recording libraries: v20 (SDK build 3.5@3533642)

Identifying performance bottlenecks

We can use PVRTune to identify bottlenecks in this game. I picked a representative frame (2338) in the recording file (*.pvrtune). The following will explain how to identify bottlenecks using the PVRTune. The bottlenecks usually fall into one of five categories:

  • CPU limited.
  • Vertex limited
  • V-Sync limited
  • Fragment limited
  • Bandwidth limited
CPU limited

A CPU limited application is often identifiable as an application suffering from poor performance or frame rate even though the graphics core usage is not high. In PVRTune this can be very easily identified since CPU limited applications have a CPU load that is at or near one hundred percent (a).

2-Fantasy Warrior 3D - PVRTuneOther identifying factors include gaps in the shader load, caused by the PowerVR hardware going to sleep while waiting for CPU operations to complete (b) or the GPU waiting for the next vsync interval.

For this game we captured the following data:

3-Fantasy Warrior 3D - PVRTune (1)

3-Fantasy Warrior 3D - PVRTune (2)

We could see that the CPU load sits at just 12.0%, but there are really lots of big gaps in Tiler and Renderer timing blocks. So the PowerVR hardware has to sleep while waiting for instructions from the render thread 10612. So the biggest problem in this game is that the Cocos2d-x engine does not have an individual thread for rendering. For every frame, the render task must wait for the game logic to finish. You can find big gaps between graphics API calls for every frame in third row timing block. This means we are in a CPU limited scenario.

Vertex limited

Vertex limited applications are applications where the bottleneck comes from processing either large amounts of vertices per frame, or from the use of a complex vertex shader, or both. This can be identified by large gaps between Renderer tasks (a) while there is little or no gap between Tiler tasks (b).

5-Fantasy Warrior 3D - Vertex processing

Further information can be gained from the processing load in the Vertex and the Tiler load counters. If the Tiler active indicator is high (c) but Processing load: Vertex is not then the scene has too many vertices in it and the cost is coming from the tiling process. On the other hand, if Processing load: Vertex is high (d) but Tiler load is not, then the bottleneck is likely to be in the vertex shader.

Here is the data we got for Fantasy Warrior 3D:

6-Fantasy Warrior 3D - Vertex processing

Obviously, there are lots of gaps in both Renderer tasks and Tiler tasks. The average Processing load: Vertex is 1.6%, the peak value is 14.4%, and the average Tiler active is 10%. Although the frame average Processing load: Vertex and Tiler active are actually very low, we can still optimize the vertex shader with PVRShaderEditor. Luckily, Fantasy Warrior 3D is not vertex limited.

V-sync limited

Vertical synchronization (V-sync) is a display option that forces an application to synchronize graphical updates with the update rate of the screen. This causes some frames to be slightly delayed and enforces a maximum refresh rate, but reduces screen tearing and can save power. V-sync limited applications are often characterized by intermittent gaps between frames in the graph view, and the frame rate appears to be limited at a set maximum value. If possible, v-sync should be disabled when profiling an application as it adds noise to the PVRTune output and this makes it more difficult to diagnose where optimization work could be beneficial or if completed optimization has been successful.

If we analyze the data for Fantasy Warrior 3D, we see the gaps between frames are very stable (1-2 ms). In addition, the FPS rate is 29.3 for this frame. Since the FPS rate for each frame does not appear to be limited at a set maximum value, the game is not V-sync limited.

Fragment Limited

Fragment limited applications are very common and occur in most scenes that have fewer vertices than the number of pixels in the framebuffer. Fragment limited applications can be identified when there is the presence of no gaps between Renderer tasks (a), large gaps between Tiler tasks (b) or a high value of Processing load: Pixel (c).

7-Fantasy Warrior 3D - Fragment processing

But for this game we got data like the following:

8-Fantasy Warrior 3D - Fragment processing

The Processing load: pixel is 46.3% and there are always large gaps in Renderer tasks. So Fantasy Warrior 3D is not fragment limited.

Bandwidth Limited

Cases of bandwidth limited applications are both hard to visualize and identify, as they may appear as other bottlenecks. Programs may be bandwidth limited if:

  • Timeline shows the application to be fragment limited but the Processing load: Pixel is low.
  • Timeline shows the application to be vertex limited but the Processing load: Vertex and Tiler active are low.

Other instances of bandwidth limitation may occur. For example, bandwidth in System-on-Chip (SoC) devices is shared among all components of the chip. Non-graphics processor areas of the chip (the CPU, for example) using large amounts of bandwidth may still cause application graphics to be bandwidth limited. This is platform specific and, as such, there is no counter to record it. As a rule of thumb, action should always be taken to reduce bandwidth use whenever possible through the correct use of texture compression, mesh optimization, and by avoiding unnecessary texture reads, etc.

According to the conclusions of the Vertex limited and Pixel limited sections of this article, this game is not bandwidth limited.


This game is a typical CPU limited case. As discussed in the CPU limited section of this article, moving the OpenGL ES call submission to a dedicated CPU thread will keep the GPU busy and should improve the framerate on many devices. In the next post, I will discuss how the advanced features of PVRTune can be used to isolate the specific causes of performance bottlenecks in Fantasy Warrior 3D.

Here is a menu to help you navigate through every article published in this optimization series:

Please let us know if you have any feedback on the materials published on the blog and leave a comment on what you’d like to see next. Make sure you also follow us on Twitter (@ImaginationTech, @GPUCompute and @PowerVRInsider) for more news and announcements from Imagination.

Sun Kevin

Sun Kevin

Kevin Sun is a leading PowerVR developer technology engineer for Imagination Technologies.

Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom
Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

pvrtune complete

What is PVRTune Complete?

PVR Tune Complete highlights exactly what the application is doing at the GPU level, helping to identify any bottlenecks in the compute stage, the renderer, and the tiler.

Read More »
shutterstock 1175807410 1

Vulkan synchronisation and graphics-compute-graphics hazards: Part I

How do you mix and match rasterisation and compute in a modern GPU? In modern rendering environments, there are a lot of cases where a compute workload is used during a frame. Compute is generic (non-fixed function) parallel programming on the GPU, commonly used for techniques that are either challenging, outright impossible, or simply inefficient to implement with the standard graphics pipeline (vertex/geometry/tessellation/raster/fragment).

Read More »


Sign up to receive the latest news and product updates from Imagination straight to your inbox.