How to Use Objective Quality Measurement Tools

Every compressed file involves dozens of configuration-related decisions, including resolution, data rate, H.264 profile, VBR or CBR, entropy coding technique, x.264 preset, b-frames, reference frames—the list goes on and on. Most encoding professionals simply use configurations gleaned from presets supplied with their encoding tools, or perhaps from recipes found on the web. But how can you be sure that you’re squeezing the last bit of quality out of the selected data rate, or that your videos are optimally bandwidth-efficient? How can you tell how much additional quality a 1080p@ 7.5Mbps stream delivers over the 5.5Mbps stream?

Basically, you have three options: ignore the issue and hope for the best, implement time-consuming and expensive subjective testing, or use objective quality metrics, which are less expensive and consume less time, but still require investments of both money and effort. Over the past 18 months, I’ve adopted the last alternative. In this article, I’ll introduce you to two objective quality measurement tools, and describe how I use them to make better-informed compression-related decisions. But let’s start with a brief description of what objective quality benchmarks actually are.

What Are Quality Metrics?

Without question, the gold standard for assessing video quality is a controlled subjective test, which, as previously mentioned, can be time-consuming and expensive to run. Objective quality benchmarks are algorithms that compare the compressed video with the source and render a value that predicts how the compressed file would fare in subjective tests. There are multiple algorithms, all rated according to how well they correspond with actual subjective evaluations. None are perfect, but some perform better than others.

I use two tools to compute these scores: the Moscow University Visual Quality Comparison Tool (VQMT, $995) and the SSIMWave Video Quality-of-Experience Monitor (SQM, ~$2,400). Both run in GUI and batch mode, which is a lifesaver for most projects.

Briefly, VQMT is an algorithm-agnostic tool that lets you run more than 20 different quality algorithms, or versions of algorithms, including the familiar Peak Signal-to-Noise (PSNR) ratio, and Structural Similarity Index (SSIM). For various reasons, I’ve standardized on the VQM metric, where lower scores indicate superior quality. Still, the ability to compute PSNR and SSIM is often useful for clients or supervisors who are familiar with the metric and want to see the results.

From a usability standpoint, operation is simple in both batch and GUI modes. The GUI can process two files simultaneously (Figure 1), which is amazingly convenient when you’re comparing different encoding alternatives and want to view the differences in the actual frames, which the VQMT interface facilitates. The primary limitation is that you can only compare the quality of files at the same resolution as the source. This prevents analysis in the manner discussed below, where you’re trying to find the best resolution for a file at a given bitrate. Beyond this limitation, VQMT is very useful, and there’s a free trial version you can download that processes files up to, but not including, 720p in resolution. You can find information about the product and trial version, read my review of the product, and watch a short demo on YouTube.

a01p01.jpg

VQMT can compare two files at once and presents this visualization that lets you scan through the tested file(s). Click Show frame to view the actual frame.

The SSIMWave SQM tool offers a different value proposition. Specifically, the tool is built around the company’s SSIMplus algorithm, which was coinvented by Zhou Wang, the company’s cofounder and co-inventor of the SSIM algorithm, which recently won an Emmy from the Television Academy. According to tests performed by company researchers, the newer SSIMplus algorithm provides the most accurate matching between SSIMplus scoring and actual subjective ratings of all tested algorithms, which included SSIM and VQM, the algorithm I use with VQMT. Today, the SQM tool is the only way to access the SSIMplus algorithm.

Unlike the VQMT, SQM ratings predict subjective evaluations, so a score of 80 to 100 predicts that live viewers will find the video excellent in quality; 60 to 80 predicts that viewers will rate the video good in quality, and so on down to zero. In contrast, the VQM rating can tell you which video has higher quality, but it doesn’t correlate to any level of viewer perception.

Beyond this, SQM offers two key features not available on the VQMT. First, you can select a device-specific profile and SQM will render a score that predicts how viewers watching on those devices will rate the video. This is important, because what looks good on a smartphone doesn’t necessarily look good on a 65" 4K TV set. Second, SQM can predict scores at resolutions different than the source resolution. This enables the second analysis presented below, where you want to find the optimal resolution for a specific bitrate file.

When I wrote my review of SQM, the product was very competent, but lacked the visualization tools VQMT provides. As shown in Figure 2, SSIMWave has added these, bringing the tool up to par with VQMT in this very important regard.

a1p2.jpg

SQM’s new QoE Analyzer, a very useful visualization tool for SQM

How do I use the two tools? After months of working with both, I’ve found VQM to be a more sensitive canary in a coal mine than SQM, and better at identifying small differences between files. As you see in Table 3 on page 150, where VQM found a 6.8 percent difference between the 5Mbps and 6.5Mbps files, SQM found a 0.12 percent difference. Of course, sometimes the differences don’t add up to anything perceptible, as the SQM scores suggest, but since VQMT makes these differences very easy to spot, I still find it very convenient. Besides, sometimes lots of little differences add up to a big difference, and VQMT reveals the individual components of the big difference.

Of course, SQM provides a very useful counterpoint. If VQMT says the sky is falling, so to speak, and SQM says relax, I tend to relax. Moreover, SQM provides the multiple-resolution (and soon, multiple-frame rate) analysis, and device-specific profiles that VQMT doesn’t offer. I find both tools invaluable in their separate roles.

My Test Files

Let’s spend a couple of moments describing the test files. As you’ll see, different types of videos respond differently to various compression options. For this reason, if you’re working with different types of videos, you should create short test files and test each type. Here are the files that I tested in the examples below.

- Tears of Steel—the Blender Foundation movie; mix of animation and live action video (mostly live action)

- Sintel—Another Blender Foundation movie; all animation, but very lifelike rather than cartoonish

- Big Buck Bunny—Yet another Blender Foundation movie; all animation, but more cartoonish than Sintel

- Screencam—a screencam from the VQMT YouTube demo referenced above

- Tutorial—a PowerPoint presentation with talking head video grabbed from a Udemy course on Multiple Screen Delivery

- Talking Head—a simple talking head video of yours truly in my office

- Freedom—Multicam concert footage (HDV/AVCHD) of the fabulous Josiah Weaver at the Greensboro Coliseum

- Haunted—footage from a trailer I shot with a DSL for the Haunted Graham Mansion

    Let’s jump into our tests.

    Custom Encoding or All Files the Same?

    If you work with more than one file type, the first question you have to address is whether to encode them all using the same ABR group. This first test seems to indicate that the answer is probably not. To explain, for this test, I encoded the eight 720p test files in Handbrake using constant rate factor (CRF) encoding with a value of 19. Briefly, CRF encoding adjusts the data rate of the file to maintain a constant quality level. As you can see in the SQM column at the far right of Table 1, all of the videos range in quality from 95 to 99, which predicts that viewers would rate these videos as excellent. However, the screen-cam and tutorial videos achieved 99 percent quality level at 11 percent and 8 percent of the maximum data rate recorded in this test. In other words, you can encode these types of files at roughly 10 percent of the data rate of real-world video, and achieve the same quality level. Interestingly, with most encoders, once you choose a target data rate for these types of files, the encoder will deliver that rate, even though it could deliver the same quality at much lower data rates.

    a01p03.png

    Table 1. Data rates required for specified CRF levels.

    Note that Tears of Steel and Sintel were both produced and encoded at 24 frames per second. To compare their data rates to the other 30 fps files in the test, you’d have to add 20 percent to their data rates, which boosts their comparable data rates to around 4,800Kbps. This compares to 2,559Kbps for Big Buck Bunny, which was produced at 30 fps. The takeaway here is that simulated real world animations, such as Sintel, encode like live action videos, while more cartoonish animations, such as Big Buck Bunny, are a different class that might be able to support a much lower data rate and still achieve the same quality level.

    Interestingly, the “Talking Head” video showed only a 17 percent reduction in data rate as compared to the highest data rate files. This differential may increase for your hardest to encode sports or other high-motion videos, but at 17 percent, I wouldn’t recommend a different adaptive group for talking head videos than the other live action videos in this group.

    Before leaving, let’s observe that in this example the VQM results aren’t that helpful because the numbers don’t correspond with any subjective quality level. While the lower numbers for the screencam and tutorial videos indicate higher quality than the others, there’s no correlation with any subjective evaluation.

    Configuration of 1800 File

    Table 2 presents another reason not to encode screencam and tutorial videos the way you encode live action videos; simply stated, they look worse when subsampled to lower resolutions. But let’s back up.

    If you look at the adaptive group recommended in Apple Technical Note TN2224, you’ll note a significant gap between the 640x360@1,200Kbps file, and the next largest 960x540@3,500Kbps file. If you’re concerned about mobile viewers getting the best experience, and you should be, you might want to fill in that gap with a file around 2500Kbps. The obvious question is, what’s the best resolution for that file, and that’s what we test in Table 2.

    a01p04.jpg

    Table 2. Choosing the best resolution for our 2,500Kbps file

    To complete the table, I encoded all 1080p source files at the various resolutions to 2,500Kbps, and computed their SQM scores for the iPhone 6 Plus and iPad Air 2, which are shown in the table. As you probably suspect, the worst scores have a red background, while the highest scores have a green background. The Delta column shows the difference between the highest and lowest score, and I’ve highlighted the results for the tutorial and screencam video.

    Intuitively, we know that these various options trade detail for more data per pixel, which translates to higher-quality pixels. That is, the 480p video has the lowest detail, but the highest quality per pixel, while the 1080p video has the most detail, but the lowest quality per pixel. In all cases, the 480p video delivered the worst quality on the two tested devices. However, the results were substantially worse on both devices for the screen-cam and tutorial videos, which contain lots of detail that blurs or otherwise loses quality when subsampled to lower resolutions.

    Given this data, I would avoid the 480p and 540p files for the 2,500Kbps file in all cases, and might consider scratching the 540p resolution suggested by Apple for the 3,500Kbps file. It seems that jumping directly to 720p for both might be the best option.

    For “Screencam” and “Tutorial” videos, which look great at very low data rates, it might make more sense to eschew an adaptive group altogether, and simply deliver a single 1080p or 720p file at 300 to 400Kbps. Given the SQM results, I would look long and hard at delivering these files at less than 720p.

    CBR vs. VBR

    The next issue involves constant bitrate encoding (CBR) versus variable bitrate (VBR) encoding. Within the context of an adaptive group, many experts recommend using CBR to avoid spurious stream changes that relate to data rate fluctuations rather than actual changing bandwidth conditions. The big question is, how much quality are you losing with CBR? Quite a bit, it turns out, in two critical ways.

    To test the quality differential, I encoded the test files at 720p@2,500Kbps using single- and dual-pass CBR encoding, and 125 percent, 150 percent, and 200 percent constrained VBR encoding. The results are shown in Table 3 using the VQM metric, again, where lower scores are better.

    a01p05.jpg

    Table 3. CBR vs. VBR encoding

    Cells with red backgrounds indicate the worst quality (highest values with the VQM metric), while cells with a green background indicate the highest quality. The total difference between the highest and lowest quality is shown in the column to the far right.

    The table reveals two interesting points. First, CBR always produces the lowest quality, though you can mitigate that in most cases by using two-pass rather than one-pass encoding. Second, 125 percent constrained VBR encoding delivers almost the same quality in all tests as 200 percent constrained DVR, while producing a much more ABR-friendly file. If you do opt for VBR, 125 percent constrained might be the best option.

    Beyond the numbers, CBR files often exhibit several short sections of very low-quality frames that were produced when the encoder has to encode high-motion regions at very low data rates to meet the requirements of the restrictive encoding scheme. An example of this is shown back in Figure 1. If you study the figure, you’ll note that the play head is placed at a section where the quality curves of the CBR and VBR files differ substantially. If you looked at the frames in the video files at this point, you’d see a substantial quality difference between them, even though the average difference might be modest. So not only is the overall quality lower with CBR, many files will exhibit one or two areas where the quality differential is substantial and very noticeable, albeit for only a few frames.

    Does that mean you should switch to VBR encoding for your ABR schemes? It’s certainly worth trying, especially since many producers have long used up to 200 percent constrained VBR in their adaptive groups without problems.

    How High Is High Enough?

    The last issue addresses how high is high enough? Specifically, when you’re producing an adaptive group of files, what’s the optimal maximum target data rate at the top of the group? Obviously, this number is key because it’s the highest bandwidth file and costs the most to deliver.

    Table 4 shows the results of tests performed for a recent consulting client, where their adaptive group included three 1080p files, encoded at 5,000Kbps, 6,500Kbps, and 7,500Kbps. Looking at the SQM results, all three rate in the excellent range, with about 0.25 percent difference among them. The VQM scores showed a greater differential, but comparing the files in the Results Visualization screen showed very minor differences throughout. In both cases, the 7,500Kbps file costs 50 percent more to deliver than the 5,000Kbps file, but provides no real quality improvement. Perhaps in a premium service the difference might be worth it, but for most other services, 5,000Kbps should be sufficient.

    a01p06.jpg

    Table 4. Choosing the top data rates for your 1080p files

    Final Thoughts

    The basic message is that for most important compression-related decisions, objective quality metrics provide useful data at a fraction of the cost of subjective testing. There are some caveats, of course. First, take the results shown above as examples, not fact. Your results will vary by codec and encoding tool, as well as test clips. Second, never take the numbers completely at face value; before making any key decision, always play and watch the clips in real time.

    Third, choose your test clips carefully. In my view they should represent a good range of both simple and challenging clips. Test clips that are too challenging could push your data rates to levels required only by 1 percent or 2 percent of your video. Certainly you want to know when you’re pushing the quality envelope, but try to test for your entire library, not just the hardest sequences.

    Along these lines, it’s good to articulate the quality of video that you expect to deliver, particularly at the high end. If your goal is to avoid noticeable quality problems in all streams, that’s achievable, but only at very high data rates. Perhaps a better goal is the one that most producers seem to have adapted, that even satellite and cable TV streams get ugly on occasion, and it’s okay if your video does as well.

    Whatever quality level you seek, there’s no reason fly blind any longer when it comes to how your tweak your encoding parameter and configure the files in your adaptive group. Objective quality benchmarks provide exceptionally useful data for any producer wanting to substitute fact for untested opinions.

    This article appears in the 2016 Streaming Media Industry Sourcebook.


    Comments (4)

    Andy Hickman
    Said this on 07/01/2016 At 07:30 am

    Jan,

    Thanks for the great article. I've read a fair bit recently about optimising encoding ladders according to content type and codec. What I haven't seen is whether larger companies are optimising encoding ladders according to (a) national/ISP variation in network characteristics for average bandwidth, bandwidth variation and error connections, or (b) the specific device population being targeted (e.g. Android vs iOs vs desktop vs Smart TV) which varies considerably according to the service or region. For (a), I guess this gets done at a crude level - i.e. set your mid profile at a bitrate that is sustainable for a typical broadband connection - but you could imagine using known network statistics to optimise the encoding profile in more sophisticated ways than this, and also varying your encoding ladder from country to country.

    Have you seen any guidance or research on those topics?

    Andy

    Said this on 07/01/2016 At 01:07 pm
    Andy:

    Thanks for your kind words. I think the bulk of the new research has related to customizing the encoding ladder for content, and I think most large publishers already customize by device.

    Not seen any regional ladders, but I'm guessing that Netflix and other international distributors encode differently in each region (but that's a total guess).

    I think most producers assume that the largest sweet spot for sustainable delivery is around 3 mbps, and try to have a high quality stream at that level.

    Hope this helps.

    Jan
    Said this on 07/05/2016 At 03:46 am

    Very interesting info Jan. Much thanks for your dedication and examples. Matt

    Said this on 07/05/2016 At 06:16 am
    Matt:

    I'm glad you found it useful; thanks for sharing.

    Jan

    New comments are currently disabled.