Synthetic GPU benchmarking and other performance benchmarking has been a point of contention in the embedded industry for as long as the process has e...

Synthetic GPU benchmarking and other performance benchmarking has been a point of contention in the embedded industry for as long as the process has existed. Historically, GPU benchmarking has delivered little outside of some attention on macroscopic subsections of performance while claiming to tell people about the holistic performance of a GPU. Customers deserve more than that, especially when considering the weight put on these numbers as it relates to buying decisions.

Today’s problems with benchmarking can be summed up in a simple question: When it comes to investing in any form of technology, would you feel more comfortable knowing how something performs using a real-world example, or are you comfortable working against a theoretical situation? Knowing things like how readily a GPU delivers the graphics of a user’s favorite game and the length of time that they can be played at a suitable fps is useful on both a consumer and OEM level.

Mobile GPUs are in the midst of an evolution similar to the one desktop GPU benchmarking went through:

· Phase 1: Benchmarking consists of extremely theoretical and somewhat confused comparisons of architectural triangles per second and pixels per second rates.

· Phase 2: Previous benchmarks are developed into actual applications that supposedly measure triangles and pixels before arbitrary spinning objects.

· Phase 3: Benchmarks consist of synthetic game scenes designed specifically to test a GPU’s maximum compute capacity. This is where we are now with the mobile GPU.

· Phase 4: Benchmarks expand to cover the comparison of metrics garnered by running actual content and assessing each GPU’s merits based on that.

Case study: Real application versus synthetic benchmarking

Examining benchmarks frame by frame can provide a little more color to the situation. Current popular benchmarks claim to stress-test GPUs to discover the maximum number of frames they can deliver in a certain time period. While some audiences may be interested to know that one device is capable of delivering 300+ frames of a particular content in a fixed time period rather than another that can only deliver 250+ frames, it’s not the content that consumers really need. It’s arbitrary at best and does not correspond to any experience that they might have of the device.

ARM has been internally running benchmark tests with more than one million frames of real content from top OpenGL ES-enabled games on the app store, analyzing multiple performance areas. We’re using popular games like Angry Birds and are analyzing things like CPU load, frames per second, uArch data, and a ton of GPU agnostic API usage and render flow composition data. The data gathered in this analysis yields some very interesting results. For instance, the imagery in Asphalt 7 and other HD games on the same GPU appear to show similar levels of graphical user experience. This would leave a user to believe that they are constructed from a broadly similar level of workload, but this is not the case (Figure 1).

**Figure 1:** The test results appear to show similar levels of user experience and therefore a similar level of workload, but this isn’t the case.

When examining data from popular benchmarks versus data from real applications, fragment count for benchmarks is similar to that of popular games, while the vertex count goes through the roof. Globally, the average for primitive to fragment ratio in this benchmark at 1080p is 1:13.1. However, examining the content draw call by draw call, 50 percent of Benchmark C draw calls have a ratio of less than 1:1 primitive to fragment and an additional 24 percent have a ratio of less than 1:10, which is directly against a recommended guideline of more than 1:10 (Figure 2). The effect of this is that rather than the benchmark giving a feel for overall performance it effectively becomes a micro-benchmark of a single aspect of performance, which, due to the more balanced workloads in real applications, is rarely a factor.

**Figure 2:** When examining real application data, Figure 1’s Benchmark C draw calls give misleading results for performance.

Actual games are far more balanced and consistent with fewer micro triangles and the majority of draw calls handling more than 10 fragments per triangle. Benchmark providers admit they use high vertex counts to stress GPUs, claiming it provides the users with “realistic” feedback on how their GPU will respond to future content. However, it’s apparent such stress testing is not realistic, as it doesn’t accurately reflect the balance of fragment and geometry used in applications that are being used by consumers on a daily basis.

Geometry imbalances do not take into account the most limiting factor in terms of mobile device performance: the bandwidth (Figure 3).

**Figure 3:** Relative bandwidths in benchmark tests compared to real applications.

The real world applications are much more consistent in the balance of bandwidth used across the rendering. What we see here is 3-8x more bandwidth being used for the geometry, meaning there is less bandwidth available for fragment generation – which is what the user will actually see. This again generates a false impression of capability by focusing on a microscopic effect of the architectural choice rather than the macroscopic performance effect. Effectively, the supposed differences highlighted by these benchmarks will never be perceptible to the end user for real use cases, but fuels the arms race pushing silicon footprint and power envelopes to support production of a bigger number.

Five steps for change

As synthetic benchmarks aren’t going away, they should at least follow these rules:

· Follow Moore’s Law: Moore’s Law (compute potential doubles every 18 months) applies to GPUs as much as it does CPUs. Year on year, the average workload represented in a benchmark should not exceed double the previous year’s and it should remain balanced. This way companies don’t attempt to outstrip Moore’s Law.

· GPU over bandwidth test: The raw bandwidth per frame at 60 fps should not exceed the available bandwidth. The baseline for bandwidth should be set at a typical mobile device for the next 24 months. Make the objective of the test as independent as possible, whether the device has high bandwidth capacity or not.

· Use recognized techniques: Techniques should be aligned with current best practice and appropriate for the type of scene. These techniques should also be relevant to the mobile market (see bandwidth rule).

· Excessive geometry is not an acceptable proxy for workload: Primitive to fragment ratio per draw call should be balanced. Lots of present benchmarks have far too much geometry. The 10 frags/prim rule should be the lowest watermark for this.

· Overdraw is not an acceptable proxy for workload: An overdraw average in excess of 2x on any surface is not representative. Instead add a feature that offers a visual return on investment for the user (something they can actually see).

GPU benchmarks still have a long way to go; however, adoption of the above rules will at least bring synthetic benchmarking a little closer to something representative of real content.

The world of mobile content is in itself dynamic and constantly evolving. Eventually to cope with this the industry will have to arrive at a place similar to that of desktop, where real application workloads become the benchmark, allowing for a far more well rounded view of the GPU.

Ed Plowman is Director of Solutions Architecture at ARM. Contact Ed at [email protected], LinkedIn, or the ARM Community.

ARM

www.arm.com

Twitter: @ARMEmbedded

LinkedIn: http://www.linkedin.com/company/arm

Google+: https://plus.google.com/+arm/

YouTube: http://www.youtube.com/ARMflix/

Blog: http://community.arm.com/

Ed Plowman (ARM)