Building Heterogeneous Systems with PowerVR: OpenCL Programmer’s Reference

Download Description:


1. Introduction
1.1. Example use case and SoC bandwidth constraints
1.2. PowerVR Imaging Framework for Android
1.3. PowerVR Imaging Framework SDK
2. Guide to writing OpenCL kernels for Rogue
2.1. OpenCL overview
2.2. OpenCL execution on the Rogue architecture
2.3. Kernel programming guidelines
2.4. Case study: Image convolution filtering
2.5. Caching frequently-used data in the common store
3. Measuring performance
3.1. Utilization
3.2. Occupancy
3.3. Parallelism
4. Advanced programming usage
4.1. Maximizing Execution Speed
4.2. Workgroup Size Considerations
4.3. Occupancy Considerations
4.4. System memory access
4.5. Unified store access
4.6. Common store access

Appendix A. Supported zero-copy flows
A.1. GPU sampling of native YUV semi-planar image
A.2. GPU sampling of native YUV planar image
A.3. GPU sampling of YUV image for processing as RGB
A.4. GPU sampling from Android Camera HAL
A.5. Sharing OpenCL buffer between CPU and GPU

Appendix B. Measuring Rogue peak performance

Appendix C. Mathematical functions accuracy
C.1. Standard functions
C.2. Native functions

Appendix D. Rogue compute capabilities

D.1. Features and technical specifications

Appendix E. Terminology

List of Figures
Figure 1: Components of a modern application processor SoC
Figure 2: Abstract heterogeneous architecture comprising CPU and GPU
Figure 3: Example vision software pipeline implemented on top of hardware
Figure 4: Example zero-copy flow between ISP and GPU
Figure 5: Zero-copy transfer between a camera and display
Figure 6: Suggested flow for developers to benefit from the PowerVR imaging framework
Figure 7: Example NDRange in 2 dimensions comprising 512 work-items
Figure 8: Rogue architecture and data flow overview
Figure 9: A single multiprocessor with 16 residency slots. Six slots are occupied by warps of which one warp is active (running on the Execution Unit), four warps are ready to be executed and one warp is blocked on a memory or barrier operation
Figure 10: Parallel execution: All of the work-items in the active warp execute the statement ‘wi = ai * bi’ together (for i=0..31) using variables private to their local context
Figure 11: Serial execution of a warp due to divergence: All work-items in the active warp execute different case statements of the switch, causing execution of the warp to be fully serialised
Figure 12: A blocked warp on a multiprocessor execution unit
Figure 13: Scheduling of four warps over time, fully hiding memory latency
Figure 14: Scheduling of four warps over time, unable to hide all memory latency
Figure 15: Example of image filtering by means of convolution
Figure 16: A 3x3 image filter example showing overlap between adjacent sampled values. The 32 work-items in a workgroup require a total of 60 input pixels
Figure 17: Three OpenCL kernels executing on the GPU. Related CPU commands are shown as arrows on the timeline
Figure 18: Example scheduling of a kernel comprising 4 warps with 75% utilization
Figure 19: Occupancy graphs
Figure 20: Parallel versus serial execution of a statement in a warp
Figure 21: Common Store organization
Figure 22: Zero-copy YUV semi-planar native flow between hardware and GPU
Figure 23: Zero-copy YUV planar native flow between hardware and GPU
Figure 24: Zero-copy YUV-to-RGB flow
Figure 25: Integrating zero-copy flow with Android Camera HAL
Figure 26: OpenCL buffer allocation with CL_USE_HOST_PTR
Figure 27: OpenCL buffer allocation, access buffer data
Figure 28: OpenCL buffer allocation, access to buffers shared by map and unmap

  Building Heterogeneous Systems with PowerVR: OpenCL Programmer’s Reference - Accept Terms & Request Download.