Feb 232016

H.265/HEVC logoAfter [compiling] and running the x265 HEVC encoder, and after [looking at its quality] for animated content, here’s another little piece of information about my experiments with H.265/HEVC. And this time it’s the decoding part. Playing H.265-encoded videos on the PC is relatively easy. On Windows I tend to use [MPC-HC] for this, and on Linux/UNIX you can use [mplayer] or [VLC]. The newest versions of those players are all linked against a modern libav or ffmpeg library collection, so they can decode anything any H.265 encoder can throw at them.

The questions are: At what price? And: What about mobile devices?

H.265/HEVC is costly in terms of computation, and not just in the encoding stage. Decoding this stuff is hard as well. So I looked at two older Core 2 processors to see how they fare when decoding regular 10-bit H.264/AVC and the same content encoded as 10-bit H.265/HEVC, both times at the same bitrate of 3Mbit/s ABR. Again, the marvelous Anime movie “The Garden of Words” was used for this. The video player of my choice for playback was [MPC-HC] v1.7.10, rendering to a VMR9 surface.

On top of that, I can also provide some insight on how relatively modern Android devices will handle this (devices partly without a H.265 hardware decoder chip however!), all thanks to [Umlüx], who’s been willing to install the necessary Apps and run some tests! On Android, [MX Player] was used.

For the record, the encoding settings were like this for x264 (pass 1 & pass 2), …

--fps 24000/1001 --preset veryslow --tune animation --open-gop --b-adapt 2 --b-pyramid normal -f -2:0
--bitrate 3000 --aq-mode 1 -p 1 --slow-firstpass --stats v.stats -t 2 --no-fast-pskip --cqm flat

--fps 24000/1001 --preset veryslow --tune animation --open-gop --b-adapt 2 --b-pyramid normal -f -2:0
--bitrate 3000 --aq-mode 1 -p 2 --stats v.stats -t 2 --no-fast-pskip --cqm flat --non-deterministic

…and this for x265:

--y4m -D 10 --fps 24000/1001 -p veryslow --open-gop --bframes 16 --b-pyramid --bitrate 3000 --rect
--amp --aq-mode 3 --no-sao --qcomp 0.75 --no-strong-intra-smoothing --psy-rd 1.6 --psy-rdoq 5.0
--rdoq-level 1 --tu-inter-depth 4 --tu-intra-depth 4 --ctu 32 --max-tu-size 16 --pass 1
--slow-firstpass --stats v.stats --sar 1 --range full

--y4m -D 10 --fps 24000/1001 -p veryslow --open-gop --bframes 16 --b-pyramid --bitrate 3000 --rect
--amp --aq-mode 3 --no-sao --qcomp 0.75 --no-strong-intra-smoothing --psy-rd 1.6 --psy-rdoq 5.0
--rdoq-level 1 --tu-inter-depth 4 --tu-intra-depth 4 --ctu 32 --max-tu-size 16 --pass 2
--stats v.stats --sar 1 --range full

PC first:

Contender #1 is my old Sony TT45X subnotebook, and this is the processor inside:

CPU-Z on a Core 2 Duo SU9600

A Core 2 Duo SU9600 Penryn, which is an ULV processor at 1.6GHz, shown under all-core load, where it doesn’t boost to 1.83GHz any longer (And yeah, this is still WinXP+POSReady2009).

Contender #2 is my secondary workstation at work, that I recently upgraded with an SSD and a better processor we had lying around. It’s using this chip now:

CPU-Z on a Core 2 Quad Q9505

A Core 2 Quad Q9505 Yorkfield, which has been overclocked to a rock solid 3.4GHz.

Let’s throw some video at them! First, we’ll try some good old H.264 on the slow-clocked Core 2 Duo mobile chip:

Core 2 Duo playing the beginning of "The Garden of Words" as 3Mbit H.264/AVC

The Core 2 Duo SU9600 playing the beginning of “The Garden of Words” as 3Mbit H.264/AVC (on a German version of Windows).

We can see some high load there. Mind you, since this is 10-bit H.264, there is no GPU acceleration whatsoever, minus maybe the bicubic scaler, which is implemented as Pixel Shader 2.0 code. Everything else has to be done by the CPU and its SSE extensions. Let’s take a look at the new H.265 then:

Core 2 Duo playing the beginning of "The Garden of Words" as 3Mbit H.265/HEVC

Same thing as before, but with 3Mbit H.265/HEVC.

The beginning of the movie, which shows only some moving logo on mostly uniform background and then some static text does just fine. You can really see what’s happening on screen when looking at the usage curve above. As soon as the first serious video frame with lots of movement starts to fade in, the machine just falls on its face, as the bitrate is rising. Realtime playback can no longer be achieved, A/V synchronization is being lost and a ton of frames are being dropped, resulting in a horrible experience.

Thing is, even with all the frame drops, it’s still losing sync even and falling short of delivering those frames which are still being decoded with the proper timing. It’s just abysmal.

So let’s change our environment to a CPU with twice the cores and roughly twice the clock speed, while staying with the same architecture / instruction set, H.264 first:

Core 2 Quad playing the beginning of "The Garden of Words" as 3Mbit H.264/AVC

The quad-core Q9505 hardly has any trouble with H.264 at all. It’s completely smooth sailing.

And with the harder stuff:

Core 2 Quad playing the beginning of "The Garden of Words" as 3Mbit H.265/HEVC

We can see that the load has roughly doubled with H.265.

Now, H.265 is clearly putting some load even on the quad-core. I’m assuming that with higher bitrates as we would use for 4K/UHD material, this processor might be in trouble. For 1080p and bitrates up to maybe 6Mbit/s it should still be ok however. Of course, with the most modern graphics cards you can give even a Core 2 a companion device which can do H.265 decoding in hardware, like an NVidia GPU that can provide you with [PureVideo] feature sets E or even F, or maybe an AMD Radeon with  [UVD] level 6. But even then, 10-bit content might not be accelerated, so you need to be careful when choosing your GPU. Intel has recently added 10-bit support to some of their onboard GPUs (HD Graphics 5500 & 6000, Iris Graphics 6100) with driver v15.36.14.4080, nVidia added it starting with the Maxwell generation and AMD Radeons still don’t have it as far as I know.

Now, what about mobile devices? A fast enough PC can do H.265/HEVC even without hardware assist, at the very least in 1080p, and likely 4K as well, when we look at modern Core i3/i5/i7 and somewhat comparable AMD Athlon FX chips. How about multicore ARMv7/v8 chips with NEON instruction set extensions?

Let’s look at two Android devices, a Sony Xperia Tablet Z2 (without a H.265 hardware decoder) and a Sony Xperia Z5 phone:

The Tablet features the following CPU:

The Xperia Tablet Z2 uses a Qualcomm Snapdragon 801

The Xperia Tablet Z2 uses a Qualcomm Snapdragon 801 quad-core.[1]

That’s not exactly a slow chip, but an out-of-order execution pipeline with NEON extensions at a decent clock rate. And the phone:

The Xperia Z5 has a Qualcomm Snapdragon 810

The Xperia Z5 has a Qualcomm Snapdragon 810 in big.LITTLE configuration.[1]

So the phone has a very modern 64-bit ARMv8 big.LITTLE CPU setup with four faster out-of-order cores, and four slower, energy-efficient in-order cores. Optimally, all of them should be used as much as possible when throwing a seriously demanding task at the device. Let’s look at how it goes, but first on the older tablet with H.264 for starters:

The Xperia Tablet Z2 playing H.264/AVC.

The Z2 manages to play the H.264/AVC version without any stuttering, but just barely. It has to boost its clock speed all the way to the top to manage.[1]

You’re probably thinking “There’s no way this can work with H.265”, right? Well, you’d be correct:

With H.265, the Xperia Tablet Z2 stumbles.

With 3-Mbit H.265, the Xperia Tablet Z2 stumbles. Full clock speed, all cores under load, no chance to play demanding scenes without massive problems.[1]

It just bombs. Frame drops and stuttering, no way you’d want to watch anything when it goes like that. Some of the calmer scenes (=lower bitrate) still work, but lots of rain all over the frame and the Z2 is done for. So let’s move to the more modern hybrid-core processor of the Sony Xperia Z5 smartphone:

The Z5 playing the 10-bit H.264 on CPU only

The Z5 playing the 10-bit H.264 on CPU only. The octa-core big.LITTLE processor seems to load its fat cores mostly, which makes sense with demanding workloads.[1]

The big Cortex-A57 cores seem to be doing most of the work here, clocking at their maximum speed. Given we’re over 50% load with the most difficult scenes however, some threads seem to be pushed over to the smaller Cortex-A53 cores as well. In any case, the end result is that H.264 works smoothly throughout the movie. But still, load is high, and our only headroom left are some free, slow in-order cores… So, H.265?

The Z5 fails with 10-bit H.264 too. Where is my hardware decoder?!

The Z5 fails as well. While it can cope a little better, scenes like the above are just too much. Where is my hardware decoder?![1]

Something strange is happening now: H.265/HEVC hardware decoders should support this 10-bit H.265 file just fine. But for some reason, MX Player still falls back to software decoding, despite the player being up-to-date – something even a CPU as powerful as a Snapdragon 810 cannot handle for all parts of “The Garden of Words” at the given settings. The really demanding scenes will still fail to play back decently. CPU clock is also lowered a bit now, probably because of power and/or heat management.

My assumption would be that MX Player simply doesn’t support the HW decoder in this phone yet, which is a shame if it’s true. Another reason might be that I used some parameters and/or features of H.265 in this encode, that are not implemented in this chip. Whichever the case may be, the Snapdragon 810 alone cannot handle it either!

Update: After further research it turns out that almost no hardware accelerator supports Hi10p or Main-10 for H.265. In other words: No 10-bit decoding for H.265! It’s possible on NVidias Tegra K1 & X1 due to some CUDA hacks in MX Player, but nowhere else it seems. The upcoming Snapdragon 820 should however support it, and devices based on it should become available around March 2016:

Snapdragon 820s' Advancements

This is what Snapdragon 820 (MSM8996) should give us over Snapdragon 810 (MSM8994), including 10-bit H.265/HEVC decoding in hardware ([source]Russian flag)!

And this concludes my little performance analysis, after which I can say that either you need a relatively ok PC to use H.265, or a hardware decoder chip that works with your files and your software, if you’re targeting other platforms like hardware players or smartphones and tablets!

PS.: Thanks fly out to Umlüx for doing the Android tests!

[1] Images are © 2016 Umlüx, used with express permission.

Jan 052015

DirectX logo[1] Quite some time ago – I was probably playing around with some DirectShow audio and video codec packs on my Windows system – I hit a wall in my favorite media player, which is [Mediaplayer Classic Home Cinema], in its 64-Bit version to be precise. I love the player, because it has its own built-in splitters and decoders, its small, light-weight, and it can actually use DXVA1 video acceleration on Windows XP / XP x64 with AMD/ATi and nVidia graphics cards. So yeah, Blu-Ray playback with very little CPU load is possible as long as you deal with the ACSS encryption layer properly. Or decode files from your hard disk instead. But I’m wandering from the subject here.

One day, I launched my 64-Bit MPC-HC, tried to decode a new Blu-Ray movie I got, and all of a sudden, this:

MPC-HC "Failed to query the needed interfaces for playback"

MPC-HC failing to render any file – whether video or audio, showing only a cryptic error message

I tried to get to the bottom of this for weeks. Months later I tried again, but I just couldn’t solve it. A lack of usable debug modes and log files didn’t help either. Also, I failed to properly understand the error message “Failed to query the needed interfaces for playback”. Main reason for my failure was that I thought MPC-HC had it all built in – container splitters, A/V decoders, etc. But still, the 64-Bit version failed. Interestingly, the 32-Bit version still worked fine, on XP x64 in this specific case. Today, while trying to help another guy on the web who had issues with his A/V decoding using the K-Lite codec pack, I launched Microsofts excellent [GraphEdit] tool to build a filter graph to show him how to debug codec problems with Microsofts DirectShow system. You can download the tool easily [here]. It can visualize the entire stack of system-wide DirectShow splitters and decoders on Windows, and can thus help you understand how this shit really works. And debug it.

Naturally, I launched the 32-Bit version, as I’ve been working with 32-Bit A/V tools exclusively since that little incident above – minus the x264 encoder maybe, which has its own 64-Bit libav built in. Out of curiosity, I started the 64-Bit version of GraphEdit, and was greeted with this:

GraphEdit x64 failure due to broken DirectShow core

GraphEdit x64 failure due to broken DirectShow core

“DirectShow core components failed to initialize.”, eh? Now this piqued my interest. Immediately the MPC-HC problem from a few years ago came to my mind, and I am still using the very same system today. So I had an additional piece of information now, which I used to search the web for solutions with. Interestingly, I found that this is linked to the entire DirectShow subsystem being de-registered and thus disabled on the system. I had mostly found people who had this problem for the 32-Bit DirectShow core on 64-Bit Windows 7. Also, I learned the name of the DirectShow core library.


On any 64-Bit system, the 32-Bit version of this library would sit in %WINDIR%\SysWOW64\ with its 64-Bit sibling residing in %WINDIR%\system32\. I thought: What if I just tried to register the core and see what happens? So, with my 64-Bit DirectShow core broken, I just opened a shell with an administrative account, went to %WINDIR%\system32\ and ran regsvr32.exe quartz.dll. And indeed, the libary wasn’t registered/loaded. See here:

Re-registering quartz.dll

Re-registering the 64-Bit version of quartz.dll (click to enlarge)

Fascinating, I thought. Now I don’t know what kind of shit software would disable my entire 64-Bit DirectShow subsystem. Maybe one of those smart little codec packs that usually bring more problems that solutions with them? Maybe it was something else I did to my system? I wouldn’t know what, but it’s not like I can remember everything I did to my systems’ DLLs. Now, let’s try to launch the 64-Bit version of GraphEdit again, with a 64-Bit version of [LAVfilters] installed. That’s basically [libav] on Windows with a DirectShow layer wrapped around and a nice installer shipped with it. YEAH, it’s a codec pack alright. But in my opinion, libav and ffmpeg are the ones to be trusted, just like on Linux and UNIX too. And Android. And iOS. And OSX. Blah. Here we go:

64-Bit GraphEdit being fed the Open Source movie "Big Buck Bunny"

64-Bit GraphEdit being fed the Open Source movie “Big Buck Bunny” (click to enlarge)

And all of a sudden, GraphEdit launches just fine, and presents us a properly working filter graph after having loaded the movie [Bick Buck Bunny] in its 4k/UHD version. We can see the container being split, video and audio streams being picked up by the pins of the respective libav decoders, which feed the streams to a video renderer and a DirectSound output device – which happens to be an X-Fi card in this case. All pins are connected just fine, so this looks good. Now what does it look like in MPC-HC now? Like this (warning – large image, 5MB+, this will take time to load from my server!):

64-Bit MPC-HC playing the 4k/UHD version of "Big Buck Bunny"

64-Bit MPC-HC playing the 4k/UHD version of “Big Buck Bunny” (click to enlarge)

So there we go. It seems MPC-HC does rely on DirectShow after all, at least in that it tries to initialize the subsystem. It can after all also use external filters, or in other words system-wide DirectShow codecs too, where its internal codec suite wouldn’t suffice, if that’s ever the case. So MPC-HC seems to want to talk to DirectShow at play time in any case, even if you haven’t even allowed it to use any external filters/codecs. Maybe its internal codec pack even is DirectShow-based too? And if DirectShow is simply not there, it won’t play anything. At all. And I failed to solve this for years. And today I launch one program in a different bitness than usual, and 5 minutes later, everything works again. For all it takes is something like this:

regsvr32.exe %WINDIR%\system32\quartz.dll

Things can be so easy sometimes, if only you know what the fuck really happened…

The importance and significance of error handling and reporting to the user can never be understated!

But hey! Where there is a shell, there is a way, right? Even on Windows. ;)

[1] © Siegel, K. “DirectX 9 logo design“. Kamal Siegel’s Artwork.