Jan 052015

DirectX logo[1] Quite some time ago – I was probably playing around with some DirectShow audio and video codec packs on my Windows system – I hit a wall in my favorite media player, which is [Mediaplayer Classic Home Cinema], in its 64-Bit version to be precise. I love the player, because it has its own built-in splitters and decoders, its small, light-weight, and it can actually use DXVA1 video acceleration on Windows XP / XP x64 with AMD/ATi and nVidia graphics cards. So yeah, Blu-Ray playback with very little CPU load is possible as long as you deal with the ACSS encryption layer properly. Or decode files from your hard disk instead. But I’m wandering from the subject here.

One day, I launched my 64-Bit MPC-HC, tried to decode a new Blu-Ray movie I got, and all of a sudden, this:

MPC-HC "Failed to query the needed interfaces for playback"

MPC-HC failing to render any file – whether video or audio, showing only a cryptic error message

I tried to get to the bottom of this for weeks. Months later I tried again, but I just couldn’t solve it. A lack of usable debug modes and log files didn’t help either. Also, I failed to properly understand the error message “Failed to query the needed interfaces for playback”. Main reason for my failure was that I thought MPC-HC had it all built in – container splitters, A/V decoders, etc. But still, the 64-Bit version failed. Interestingly, the 32-Bit version still worked fine, on XP x64 in this specific case. Today, while trying to help another guy on the web who had issues with his A/V decoding using the K-Lite codec pack, I launched Microsofts excellent [GraphEdit] tool to build a filter graph to show him how to debug codec problems with Microsofts DirectShow system. You can download the tool easily [here]. It can visualize the entire stack of system-wide DirectShow splitters and decoders on Windows, and can thus help you understand how this shit really works. And debug it.

Naturally, I launched the 32-Bit version, as I’ve been working with 32-Bit A/V tools exclusively since that little incident above – minus the x264 encoder maybe, which has its own 64-Bit libav built in. Out of curiosity, I started the 64-Bit version of GraphEdit, and was greeted with this:

GraphEdit x64 failure due to broken DirectShow core

GraphEdit x64 failure due to broken DirectShow core

“DirectShow core components failed to initialize.”, eh? Now this piqued my interest. Immediately the MPC-HC problem from a few years ago came to my mind, and I am still using the very same system today. So I had an additional piece of information now, which I used to search the web for solutions with. Interestingly, I found that this is linked to the entire DirectShow subsystem being de-registered and thus disabled on the system. I had mostly found people who had this problem for the 32-Bit DirectShow core on 64-Bit Windows 7. Also, I learned the name of the DirectShow core library.


On any 64-Bit system, the 32-Bit version of this library would sit in %WINDIR%\SysWOW64\ with its 64-Bit sibling residing in %WINDIR%\system32\. I thought: What if I just tried to register the core and see what happens? So, with my 64-Bit DirectShow core broken, I just opened a shell with an administrative account, went to %WINDIR%\system32\ and ran regsvr32.exe quartz.dll. And indeed, the libary wasn’t registered/loaded. See here:

Re-registering quartz.dll

Re-registering the 64-Bit version of quartz.dll (click to enlarge)

Fascinating, I thought. Now I don’t know what kind of shit software would disable my entire 64-Bit DirectShow subsystem. Maybe one of those smart little codec packs that usually bring more problems that solutions with them? Maybe it was something else I did to my system? I wouldn’t know what, but it’s not like I can remember everything I did to my systems’ DLLs. Now, let’s try to launch the 64-Bit version of GraphEdit again, with a 64-Bit version of [LAVfilters] installed. That’s basically [libav] on Windows with a DirectShow layer wrapped around and a nice installer shipped with it. YEAH, it’s a codec pack alright. But in my opinion, libav and ffmpeg are the ones to be trusted, just like on Linux and UNIX too. And Android. And iOS. And OSX. Blah. Here we go:

64-Bit GraphEdit being fed the Open Source movie "Big Buck Bunny"

64-Bit GraphEdit being fed the Open Source movie “Big Buck Bunny” (click to enlarge)

And all of a sudden, GraphEdit launches just fine, and presents us a properly working filter graph after having loaded the movie [Bick Buck Bunny] in its 4k/UHD version. We can see the container being split, video and audio streams being picked up by the pins of the respective libav decoders, which feed the streams to a video renderer and a DirectSound output device – which happens to be an X-Fi card in this case. All pins are connected just fine, so this looks good. Now what does it look like in MPC-HC now? Like this (warning – large image, 5MB+, this will take time to load from my server!):

64-Bit MPC-HC playing the 4k/UHD version of "Big Buck Bunny"

64-Bit MPC-HC playing the 4k/UHD version of “Big Buck Bunny” (click to enlarge)

So there we go. It seems MPC-HC does rely on DirectShow after all, at least in that it tries to initialize the subsystem. It can after all also use external filters, or in other words system-wide DirectShow codecs too, where its internal codec suite wouldn’t suffice, if that’s ever the case. So MPC-HC seems to want to talk to DirectShow at play time in any case, even if you haven’t even allowed it to use any external filters/codecs. Maybe its internal codec pack even is DirectShow-based too? And if DirectShow is simply not there, it won’t play anything. At all. And I failed to solve this for years. And today I launch one program in a different bitness than usual, and 5 minutes later, everything works again. For all it takes is something like this:

regsvr32.exe %WINDIR%\system32\quartz.dll

Things can be so easy sometimes, if only you know what the fuck really happened…

The importance and significance of error handling and reporting to the user can never be understated!

But hey! Where there is a shell, there is a way, right? Even on Windows. ;)

[1] © Siegel, K. “DirectX 9 logo design“. Kamal Siegel’s Artwork.

Jan 232014

Tulsa logoOver the past few years, my [x264 benchmark] has been honored to accept results from many an exotic system. Amongst these are some of the weirder x86 CPUs like a Transmeta Efficēon, a cacheless Intel Celeron that only exists in Asia, and even my good old 486 DX4-S/100 which needed almost nine months to complete what modern boxes do in 1-2 hours. Plus the more exotic ones like the VLIW architecture Intel Itanium² or some ARM RISC chips, one of them sitting on a Raspberry Pi. Also, PowerPC, a MIPS-style chinese 龙芯, or Loongson-2f as we call it, and so on and so forth.

There is however one chip that we’ve been hunting for years now, and never got a hold of. The Intel TULSA. A behemoth, just like the [golden driller] standing in the city that gave the chip its name. Sure, the Pentium 4 / Netburst era wasn’t the best for Intel, and the architecture was the laughingstock of all AMD users of that time. Some of the cores weren’t actually that bad though, and Tulsa is a specifically mad piece of technology.

Tulisa Contostavlos

Tulisa? That you?

Ehm… I said Tulsa, not Tulisa, come on guys, stay focused here! A processor, silicon and stuff (not silicone, fellas).

Xeon 7140M "Tulsa"

An Intel Xeon 7140M “Tulsa” (photograph kindly provided by Thomsen-XE)

Now that’s more like it right there! People seem to agree that the first native x86 dual core was built by Intel and that it was the Core 2. Which is wrong. It wasn’t. It was a hilarious 150W TDP Netburst Monster weighing almost 1.33 billion transistors with up to 16MB of Level 3 cache, Hyperthreading and an unusually high clock speed for a top-end server processor. The FSB800 16MB L3 Xeon MP 7140M part we’re seeing here clocks at 3.4GHz, which is pretty high even for a single core desktop Pentium 4. There also was an FSB667 part called Xeon MP 7150N clocking at 3.5GHz. Only that here we have 2 cores with HT and a metric ton of cache!

These things can run on quad sockets. Meaning a total of 8 cores and 16 threads, like seen on some models of the HP ProLiant DL580 G4. Plus, they’re x86_64 chips too, so they can run 64-Bit operating systems.

Tulsa die shot

Best Tulsa die shot I could find. To the right you can see the massive 16MB L3 cache. There is also 2 x 1MB L2.

And the core point: They’re rare. Extremely rare, especially in the maxed-out configuration of four processors. And I want them tested, as real results are scarce and almost nowhere to be found. Also, Thomsen-XE (who took that photograph of a 7140M up there) wants to see them show off! We have been searching for so long, and missed two guys with corresponding machines by such a narrow margin already!

We want the mightiest of all Netbursts and Intels first native dual core processor to finally show its teeth and prove that with enough brute force, it can even kill the Core 2 micro-architecture (as long as you have your own power plant, that is)!

So now, I’m asking you to please tell us in the comments whether you have or have access to such a machine and if you would agree to run the completely free x264 benchmark on that system. Windows would be nice for a reference x264 result, but don’t mind the operating system too much. Linux and most flavors of UNIX will do the job too! Guides for multiple operating systems are readily available at the bottom of the results list in [English] as well as [German].

If anyone can help us out, that’d be awesome! Your result will of course be published under your name, and there will be a big thank you here for you!

And don’t forget to say bye bye to Tulisa:

Tulisa Contostavlos #1

Well, thanks for your visit, Miss Contostavlos, but TULSA is the #1 we seek today!

Update: According to a [comment] by Sjaak Trekhaak my statements that Tulsa was Intels first native dual core were false. There were others with release dates before Tulsa, like the first Core Duo or the smaller Netburst-based Xeons with Paxville DP core, as you can also see in my reply to Sjaaks comment. Thus, the strike-through parts in the above text.

Jun 132013

Wine LogoSure there are ways to compile the components of my x264 benchmark on [almost any platform]. But you never get the “reference” version of it. The one originally published for Microsoft Windows and the one really usable for direct comparisons. A while back I tried to run that Windows version on Linux using [Wine], but it wouldn’t work because it needs a shell. It never occurred to me that I could maybe just copy over a real cmd.exe from an actual Windows. A colleague looked it up in the Wine AppDB, and it seems the cmd.exe only has [bronze support status] as of Wine version 1.3.35, suggesting some major problems with the shell.

Nevertheless, I just tried using my Wine 1.4.1 on CentOS 6.3 Linux, and it seems support has improved drastically. All cmd.exe shell builtins seem to work nicely. It was just a few tools that didn’t like Wines userspace Windows API, especially timethis.exe, which also had problems talking to ReactOS. I guess it wants something from the Windows NT kernel API that Wine cannot provide in its userspace reimplementation.

But: You can make cmd.exe just run one subcommand and then terminate using the following syntax:

cmd.exe /c <command to run including switches>

Just prepend the Unix time command plus the wine invocation and you’ll get a single Windows command (or batch script) run within cmd.exe on Wine, and get the runtime out of it at the end. Somewhat like this:

time wine cmd.exe /c <command to run including switches>

Easy enough, right? So does this work with the Win32 version of x264? Look for yourself:

So as you can see it does work. It runs, it detects all instruction set extensions (SSE…) just as if it was 100% native, and as you can see from the htop and Linux system monitor screens, it utilizes all four CPU cores or all eight threads / logical CPUs to be more precise. By now this runs at around 3fps+ on a Core i7 950, so I assume it’s slower than on native Windows.

Actually, the benchmark publication itself currently knows several flags for making results “not reference / not comparable”. One is the flag for custom x264 versions / compilations, one is for virtualized systems and one for systems below minium specifications. The Wine on Linux setup wouldn’t fit into any of those. Definitely not a custom version, running on a machine that satisfies my minimum system specs, leaving the VM stuff to debate. Wine is per definition a runtime environment, not an emulator, not a VM hypervisor or paravirtualizer. It just reimplements the Win32/64 API, mapping certain function calls to real Linux libraries or (where the user configures it as such) to real Microsoft or 3rd party DLLs copied over. That’s not emulation. But it’s not quite the same as running on native Windows either.

I haven’t fully decided yet, but I think I will mark those results as “green” in the [results list], extending the meaning of that flag from virtual machines to virtual machines AND Wine, otherwise it doesn’t quite seem right.


Mar 222013

AnyDVD logoSince I’m very much into Blu-Ray processing/transcoding, I have been using [AnyDVD HD] from Slysoft (they became famous for their CloneCD product). At first I just tried the software, but liked it enough to actually buy a lifetime license. Since then support for the product was great with regular updates bringing the latest ACSS keys and support for different other “standards” in the industry like BD+, CSS, Sony ArcCos and so forth.

Also I like this product, because it actually comes with support for a wide range of Windows operating systems including my beloved Windows XP Professional x64 Edition. This is quite nice considering that AnyDVD HD actually requires a kernel driver, so it supports NT 5.1 (XP), NT5.2 (XP x64 / Server 2003) and also the more modern NT 6.0 (Vista / Server 2008), NT 6.1 (Win7 / Server 2008 R2) and NT 6.2 (Win8 / Server 2012).

But with the latest version which just came out (version, they really blew my mind. See the release notes for yourself, I’ve already marked the important part for you:, 2013-03-22:
– New (Blu-ray): Support for new discs
– New (DVD): Support for new discs
– New: Added Cinavia fix for PowerDVD 12.0.2625.57
– New: Rip to image sparse file creation is now optional
– New: Added dialog, if settings change require a restart
Change: Restored Windows 2000 compatibility
– Fix: Disabling Cinavia detection didn’t work with ArcSoft TMT
– Fix: Some compatibility problems with disabling Cinavia detection
– Fix: Setup hung, if machine was running on battery power
– Fix (Blu-ray): Hang with some discs during logfile creation
– Fix (Blu-ray): Incorrect handling of some discs
– Updated languages
– Some minor fixes and improvements

That’s right, it’s fucking 2013 and SlySoft is bringing back NT 5.0 (Windows 2000) support for AnyDVD HD! Without having any negative impact on the product on more modern Windows operating systems of course. Now THAT’S how I expect good software development to work! Good job guys. That’s exactly the stuff that’ll not just make me continue to use AnyDVD HD, but which is also going to make me recommend it to other people, as I already have in the past. I rarely choose to actually buy commercial software instead of just using free alternatives, but this particular piece has been so worth it!

Instead of discontinuing legacy operating system support, Slysoft is actively working towards supporting as many NT systems as they possibly can. Good job, I say!

Mar 062013

Stereoscopy logoNow if you decide to read this post, you’ll be going through a lot of technical crap again, so to make it easier for you, click on the posts logo to the left. There you get an interesting picture to look at if you get bored by me talking, and if you apply the cross-eye technique (although you’ll have to cross your eyes up to a level where it might be a bit of a strain, just try it until you get a “lock” on the “middle” picture), you’ll see the image in sterescopic 3D. Yay, boobies in stereoscopic 3D, not real ones though, very unfortunate. Better than nothing though, eh?

Now, where was I? Ah yes, stereoscopy. These days it seems to be a big thing, not just in movie theatres, but also for the home user with 3D Blu-Rays and 3D HDTVs. I’m quite into transcoding Blu-Rays using the x264 encoder as well as a bunch of other helper tools, and I’m archiving all my discs to nice little MKV files, that I prefer to having to actually insert the BD and deal with all the DRM crap like ACSS encryption and BD+. Ew.

But, with 3D, there is an entirely new challenge to master. On a Blu-Ray, 3D information is stored as a so-called MVC stream, which is an extension to H.264/AVC, which also means, there is no Microsoft Codec involved anymore, so there is no VC-1 3D. Only H.264. This extension is for the right eye and works like a diff stream or like DTS MLP data roughly, so while the full 2D left eye stream may be 10GB in size, the right eye one (as it only contains parts that are different from the left eye stream) will be maybe around 5GB.  Usually, stereoscopic 3D works by giving you one image for your left and one for your right eye, while filtering the left eye stream for the right eye (so the right eye never sees the left eye stream) and vice-versa. There are slight differences in the streams, which trick your brain into interpreting the information presented to it as 3D depth.

A cheap version of “filtering” is the cross-eye technique, that works best for plain images, like the logo above, which is a so called side-by-side 3D image. Side-by-side is cheap, easy, and for movies it’s very compatible, because pretty much any 3D screen supports it. Also, there are free tools available to work with SBS 3D. But, before SBS 3D can be created, the method of the Blu-Ray (which is not SBS but “frame packing”, see my description of MVC above) needs to be decoded first. Since the MVC right eye stream is a diff, and not a complete one, the complete SBS right eye stream needs to be constructed from the MVC and the left-eye H.264/AVC first. That can be done using certain plugins in the AviSynth frameserver under Windows. One of those plugins is H264StereoSource, which offers the function H264StereoSource().

Using this, i successfully transcoded Prometheus 3D. However, that was the only success ever. I tried it again as I got new 3D Blu-Rays, but all of a sudden, this (click to enlarge):

x264 3D Crash

As you can see in the shell window, there is lots of weird crap instead of regular x264 output. At some point, rather sooner than later, x264 would just crash. Taking whatever output it has generated by that time, you get something that looks like this (click to enlarge):

H264StereoSource() failure

Clearly, this is bad. As an SBS, the right frame should look awfully familiar, almost mirroring the left one. As you can see, this looks awfully not very much like anything. Fragments of the shapes of the left eye stream can be recognized in the right eye one when the movie plays, but it’s totally garbled and a bit too.. well… green. Whatever. H264StereoSource() fucked up big time, so I had to look into another solution. Luckily, [Sharc], a [Doom9] user pointed me in the right direction by mentioning [BD3D2MK3D].

This is yet another GUI with some of the tools I already use in the background plus some additional ones, most significantly another AviSynth plugin called SSIFSource2.dll, offering the function ssifSource2(). That one reads an SSIF container (which is H.264/AVC and MVC linked together) directly, and can not only decode it, but also generate proper left/right eye output from it directly.

Now first, this is what my old AviSynth script looked like, the now broken one:

  1. LoadPlugin("C:\Program Files (x86)\VFX\AviSynth 2.5\plugins\DGAVCDecode.dll")
  2. LoadPlugin("C:\Program Files (x86)\VFX\BDtoAVCHD\H264StereoSource\H264StereoSource.dll")
  3. left = AVCSource("D:\left.dga")
  4. right = H264StereoSource("D:\crash.cfg", 132566).AssumeFPS(24000,1001)
  5. return StackHorizontal(HorizontalReduceBy2(left), HorizontalReduceBy2(right)).Trim(0, 132516)

This loads the H.264/AVC decoder, the stereo plugin, reads the left-eye stream index file, a dga created by DGAVCIndex, and loads a configuration file that defines both streams as they were demuxed by using eac3to. Then it returns a shrinked version of the output (compressing the 3840×1080 full SBS stream to 1920×1080 half SBS, that most TVs can handle), trimming the last 50 frames away to work around a frame alignment bug between decoder and encoder. The cfg file looks somewhat like this:

  1. InputFile = "D:\left.h264"
  2. InputFile2 = "D:\right.h264"
  3. FileFormat = 0
  4. POCScale = 1
  5. DisplayDecParams = 1
  6. ConcealMode = 0
  7. RefPOCGap = 2
  8. POCGap = 2
  9. IntraProfileDeblocking = 1
  10. DecFrmNum = 0

By playing around with it I found out that it was truly the right-eye MVC decoding that did no longer work, H264StereoSource() was just broken, so ssifSource2() then! BD3D2MK3D will allow you to use its GUI to analyze a decrypted 3D Blu-Ray folder structure, and generate an AviSynth script for you. There was only one part that did not work for me, and that was the determination of the frame count. But there’s an easy fix. The AviSynth script that BD3D2MK3D will generate will look somewhat like this:

expand/collapse source code
  1. # Avisynth script generated Wed Mar 06 18:07:23 CET 2013 by BD3D2MK3D v0.13
  2. # to convert "D:\0_ADVDWORKDIR\My Movie\3D\MYMOVIE\BDMV\STREAM\SSIF\00001.ssif"
  3. # (referenced by playlist 00001.mpls)
  4. # to Half Side by Side, Left first.
  5. # Movie title: My Movie 3D
  6. #
  7. # Source MPLS information:
  8. # MPLS file: 00001.mpls
  9. # 1: Chapters, 16 chapters
  10. # 2: h264/AVC  (left eye), 1080p, fps: 24 /1.001 (16:9)
  11. # 3: h264/AVC (right eye), 1080p, fps: 24 /1.001 (16:9)
  12. # 4: DTS Master Audio, English, 7.1 channels, 16 bits, 48kHz
  13. # (core: DTS, 5.1 channels, 16 bits, 1509kbps, 48kHz)
  14. # 5: AC3, English, 5.1 channels, 640kbps, 48kHz, dialnorm: 28dB
  15. # 6: DTS Master Audio, German, 5.1 channels, 16 bits, 48kHz
  16. # (core: DTS, 5.1 channels, 16 bits, 1509kbps, 48kHz)
  17. # 7: AC3, Turkish, 5.1 channels, 640kbps, 48kHz
  18. # 8: Subtitle (PGS), English
  19. # 9: Subtitle (PGS), German
  20. # 10: Subtitle (PGS), Turkish
  22. #LoadPlugin("C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe\DirectShowMVCSource.dll")
  23. ## Alt method using ssifSource2 (for SBS or T&amp;B only):
  24. LoadPlugin("C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe\ssifSource2.dll")
  26. SIDEBYSIDE_L     = 14 # Half Side by Side, Left first
  27. SIDEBYSIDE_R     = 13 # Half Side by Side, Right first
  28. OVERUNDER_L      = 16 # Half Top/Bottom, Left first
  29. OVERUNDER_R      = 15 # Half Top/Bottom, Right first
  30. SIDEBYSIDE_L     = 14 # Full Side by Side, Left first
  31. SIDEBYSIDE_R     = 13 # Full Side by Side, Right first
  32. OVERUNDER_L      = 16 # Full Top/Bottom, Left first
  33. OVERUNDER_R      = 15 # Full Top/Bottom, Right first
  34. ROWINTERLEAVED_L = 18 # Row interleaved, Left first
  35. ROWINTERLEAVED_R = 17 # Row interleaved, Right first
  36. COLINTERLEAVED_L = 20 # Column interleaved, Left first
  37. COLINTERLEAVED_R = 19 # Column interleaved, Right first
  38. OPTANA_RC        = 45 # Anaglyph: Optimised, Red - Cyan
  39. OPTANA_CR        = 46 # Anaglyph: Optimised, Cyan - Red
  40. OPTANA_YB        = 47 # Anaglyph: Optimised, Yellow - Blue
  41. OPTANA_BY        = 48 # Anaglyph: Optimised, Blue - Yellow
  42. OPTANA_GM        = 49 # Anaglyph: Optimised, Green - Magenta
  43. OPTANA_MG        = 50 # Anaglyph: Optimised, Magenta - Green
  44. COLORANA_RC      = 39 # Anaglyph: Colour, Red - Cyan
  45. COLORANA_CR      = 40 # Anaglyph: Colour, Cyan - Red
  46. COLORANA_YB      = 41 # Anaglyph: Colour, Yellow - Blue
  47. COLORANA_BY      = 42 # Anaglyph: Colour, Blue - Yellow
  48. COLORANA_GM      = 43 # Anaglyph: Colour, Green - Magenta
  49. COLORANA_MG      = 44 # Anaglyph: Colour, Magenta - Green
  50. SHADEANA_RC      = 33 # Anaglyph: Half-colour, Red - Cyan
  51. SHADEANA_CR      = 34 # Anaglyph: Half-colour, Cyan - Red
  52. SHADEANA_YB      = 35 # Anaglyph: Half-colour, Yellow - Blue
  53. SHADEANA_BY      = 36 # Anaglyph: Half-colour, Blue - Yellow
  54. SHADEANA_GM      = 37 # Anaglyph: Half-colour, Green - Magenta
  55. SHADEANA_MG      = 38 # Anaglyph: Half-colour, Magenta - Green
  56. GREYANA_RC       = 27 # Anaglyph: Grey, Red - Cyan
  57. GREYANA_CR       = 28 # Anaglyph: Grey, Cyan - Red
  58. GREYANA_YB       = 29 # Anaglyph: Grey, Yellow - Blue
  59. GREYANA_BY       = 30 # Anaglyph: Grey, Blue - Yellow
  60. GREYANA_GM       = 31 # Anaglyph: Grey, Green - Magenta
  61. GREYANA_MG       = 32 # Anaglyph: Grey, Magenta - Green
  62. PUREANA_RB       = 23 # Anaglyph: Pure, Red - Blue
  63. PUREANA_BR       = 24 # Anaglyph: Pure, Blue - Red
  64. PUREANA_RG       = 25 # Anaglyph: Pure, Red - Green
  65. PUREANA_GR       = 26 # Anaglyph: Pure, Green - Red
  66. MONOSCOPIC_L     = 1 # Monoscopic 2D: Left only
  67. MONOSCOPIC_R     = 2 # Monoscopic 2D: Right only
  69. # Load main video (0 frames)
  70. #DirectShowMVCSource("D:\0_ADVDWORKDIR\My Movie\3D\MYMOVIE\BDMV\STREAM\SSIF\00001.ssif", seek=false, seekzero=true, stf=SIDEBYSIDE_L)
  71. ## Alt method using ssifSource2
  72. ssifSource2("D:\0_ADVDWORKDIR\My Movie\3D\MYMOVIE\BDMV\STREAM\SSIF\00001.ssif", 153734, left_view = true, right_view = true, horizontal_stack = true)
  74. AssumeFPS(24000,1001)
  75. # Resize is necessary only for Half-SBS or Half-Top/Bottom,
  76. # or when the option to Resize to 720p is on.
  77. LanczosResize(1920, 1080)
  78. # Anaglyph and Interleaved modes are in RGB32, and must be converted to YV12.
  79. #ConvertToYV12(matrix="PC.709")

So here, when ssifSource2() is being called, there is a number “153734”, where initially there will only be “0”. I got the run time and frame rate of the movie from the file eac3to_demux.log, that BD3D2MK3D generates. Using that, you can easily calculate your total frame number to enter there, like I have already done in this example. Also enter the framerate in the AssumeFPS() function. Here, 24000,1001 means 24000/1001p or in other words roughly 23.976fps, which is the NTSC film frame rate. Now, the tricky part: For this function to operate, the libraries and binaries of BD3D2MK3D must be in the systems search path and our x264 encoder must sit in the same directory as the libs, also you will want to make sure you’re using a 32-Bit version of x264. In our example, the installation path of BD3D2MK3D is C:\Program Files (x86)\VFX\BD3D2MK3D\, and the tools and libs are in C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe\. So, open a cmd shell, copy your preferred 32-bit x264.exe to C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe\ and then add the necessary paths to the search path:

set %PATH%=C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe;C:\Program Files (x86)\VFX\BD3D2MK3D\toolset;%PATH%

Of course, you’ll need to adjust the paths to reflect your local BD3D2MK3D installation folder. Now, you can call x264 using the generated AviSynth script as its input (usually called _ENCODE_3D_MOVIE.avs), and it should work! Just make sure you call x264.exe from that folder C:\Program Files (x86)\VFX\BD3D2MK3D\toolset\stereoplayer.exe\, and not from somewhere else on the system, or you’ll get this: avs [error]: Can't create the graph!

For me, x264 using ssifSource2() sometimes talks about adding tons of duplicate frames while transcoding. I am not sure what this means and how it may affect things, but for now output seems to work just fine regardless. When it does happen, it looks like this:

ssifSource2() frame duplicates

Nonetheless, it creates half-SBS output, and using MKVtoolnix you can multiplex that into an MKV/MK3D container setting the proper stereoscopy flags (usually half SBS, left eye first) and everything will work just fine.

My method using half-SBS will cost half the horizontal resolution unfortunately, but few TVs support full SBS, plus it’s costly because files would become far larger at similar quality settings. So far I’ve been pretty surprised how good, sharp and detailed it still looks, despite the halved resolution. So far it’s the only real possibility anyway, as full-SBS (using two 1920×1080 frames besides each other) is not very wide-spread, large and bandwidth-hungry, and for the frame packed mode that Blu-Rays use, there are simply no free tools available so far, so direct MVC transcoding is not an option. :(

But, at least it works now, thanks to the [help from the Doom9 forums]!

Update: It seems the frame duplicates are a really weird bug, that makes the output very jittery and no longer as smooth as you would expect. Even stranger is the fact, that it is not deterministic. Sometimes it happens a lot, adding tens of thousands of frame duplicates. Then, abort, restart the process and the problem is magically gone. Sometimes you might have to restart 10-20 times and all of a sudden it works. How is this even possible? I do not know. Luckily, BD3D2MK3D also comes with a second Avisynth plugin that you can use instead of ssifSource2.dll, this one is called DirectShowMVCSource.dll and gives you the corresponding function DirectShowMVCSource().

For DirectShowMVCSource() you will also no longer need to determine a frame count, it will be able to do that all by itself. Also, BD3D2MK3D will automatically add it to the Avisynth script that it generates, it’s just commented out. So, uncomment it, comment the ssifSource2 lines instead, and you’re good to go. So far, the error rate seems to be far lower with DirectShowMVCSource(), although I had a problem once, with a very long movie / high frame count, here a 2nd pass encode would just stall at 0% CPU usage.

But other than that it seems to be more solid, even if it has to suboptimally filter stuff through DirectShow. There is another more significant drawback though: DirectShowMVCSource() is a lot slower than ssifSource2(). I would say, it’s a factor of 3-4 even. But at least it’s giving us another option!

Oct 252012

Sun Grid Engine LogoOk ok, I guess whoever is reading this (probably nobody anyway) will most likely already be tired of all this x264 stuff. But this one I need as documentation for myself anyway, because the experiment might be repeated later. So, the [chair for simulation and modelling of metallurgic processes] here at my university has allowed me to try and play with a distributed grid-style Linux cluster built by Supermicro. It’s basically one full rack cabinet with one Pentium 4 3.2GHz processor and 1GB RAM per node, with Hyper-Threading being disabled because it slowed down the simulation jobs that were originally being run on the cluster. Operating system for the nodes was OpenSuSE 10.3. Also, the head node was very similar to the compute nodes, which made it easy to compile libav and x264 on the head node and let the compute nodes just use those binaries.

The software installed for using it is the so called Sun GRID engine. I  have once already set up my own distributed cluster based on an OpenPBS style system called Torque, together with the Maui scheduler. When I was introduced to this Sun GRID engine, most of the stuff seemed awfully familiar, even the job submission tools and scripting system were quite the same actually. So this system uses tools like qsub, qdel, qstat plus some additional ones not found in the open source Torque system, like sns.

Now since x264 is not cluster-aware and not MPI capable, how DO we actually distribute the work across several physical machines? Lacking any more advanced approaches, I chose a very crude way to do it. Basically, I just cut the input video into n slices, where n is the number of cluster nodes. Since the cluster nodes all have access to the same storage backend via NFS, there was no need to send the files to the nodes, as access to the users home directory was a given.

Now, to make the job more easy, all slices were numbered serially, and I wrote a qsub job array script, where the array id would be used to specify the input file. So node[2] would get file[2], node[15] would get file[15] to encode etc. The job array script would then invoke the actual worker script. This is what the sliced input files look like before starting the computation:

Sliced input file

And here, the qsub job array script that I sent to the cluster, called benchmark-qsub.sh, the directory /SAS/home/autumnf is the users home directory:

#$ -N x264benchmark
#$ -t 1-19
export PATH=$PATH:/SAS/home/autumnf/usr/bin
echo $PATH
cd /SAS/home/autumnf/x264benchmark
time transcode.sh

And the actual worker script, transcode.sh:

# Pass 1:
x264 --preset veryslow --tune film --b-adapt 2 --b-pyramid normal -r 3 -f -2:0 --bitrate 10000 --aq-mode 1 -p 1 --slow-firstpass --stats benchmark_slice$SGE_TASK_ID.stats -t 2 --no-fast-pskip --cqm flat slice$SGE_TASK_ID.264 -o benchmark_1stpass_slice$SGE_TASK_ID.264
# Pass 2:
x264 --preset veryslow --tune film --b-adapt 2 --b-pyramid normal -r 3 -f -2:0 --bitrate 10000 --aq-mode 1 -p 2 --stats benchmark_slice$SGE_TASK_ID.stats -t 2 --no-fast-pskip --cqm flat slice$SGE_TASK_ID.264 -o benchmark_2ndpass_slice$SGE_TASK_ID.264

As you can see, the worker is using the environment variable $SGE_TASK_ID as a part of the input and output file names. This variable contains the job array id passed down from the job submission system of the Sun GRID engine. The actual job submission script contains the line #$ -t 1-19 which tells the system, that the job array consists of 19 jobs, as the cluster had 19 working nodes left, the rest was already dead as the cluster was pretty much out of service and hence unmaintained. Let’s see how the Sun tool sns reports the current status of the grid cluster:

Empty Grid Cluster

So, some nodes are in “au” or “E” status. While I do not know the exact meaning of the status abbreviations, that basically means that those nodes are non-functional. Taking the broken nodes into account we have 19 working ones left. Now every node invokes its own x264 job and gives to it the proper input file from slice1.264 to slice19.264, writing correspondingly named outputs for both passes. Now let’s send the script to the Sun GRID engine using qsub ./benchmark-qsub.sh and check what sns has to say about this afterwards:

Sns reporting a grid cluster under load

Hurray! Now if you’re more used to OpenPBS style tools, we can also use qstat to report the current job status on the cluster:

Qstat showing a cluster under load

As you can see,  qstat also reports a “ja-task-ID”, which is essentially our job array id or in other words $SGE_TASK_ID. So thats basically one job with one job id, but 19 “daughter” processes, each having its own array id. Using tools like qdel or qalter you can either modify the entire job, or only subprocesses on specific nodes. Pretty handy. Now the Pentium 4 processor might suck ass, but 19 of them are still pretty damn powerful when combined, at the moment of writing you can find the cluster on [place #4 on the results list]! Here the Voodooalert style result, just under one hour:

0:58:01.600 | SMMP | 19/1/1 | Intel Pentium 4 (no HT) 3.20GHz | 1GB DDR-I/266 (per node) | SuperMicro/SGE GRID Cluster | OpenSuSE 10.3 Linux (Custom GCC Build)

To ensure that this is actually working, I have recombined the output slices of pass 2, and tried to play that file. To my surprise it worked and would also allow seeking. Pretty nice considering that there is quite some bogus data in the file, like multiple H.264/AVC headers or cut up frames. I originally tried to split the input file into slices at keyframes in a clean fashion using ffmpeg, but that just wouldn’t work for that type of input, so I had to use dd, resulting in some frames being cut up (and hence dropped), and slices 2-19 having no headers. That required very specific versions of libav and x264, as not all versions can accept garbled files like this.

Also, the output files have been recombined using dd. Luckily, mplayer using libav/ffmpeg would play that stuff nicely, but there’s simply no guarantee that every player and/or decoder would. So that’s why it cannot be considered a clean solution. Also, since motion estimation is less efficient for this setup at the cutting points, it’s not directly comparable to a non-clustered run. So there are some drawbacks. But if you would cluster x264 for productive work, you’d still do it kind of like that. Here, the final output, already containing the final concatenated file, quite a mess of files right there:

Clusterrun done

So this is it. The clustered x264. I hope to be able to test this approach on another cluster at the Metallurgy chair in the next months, a Nehalem-based machine with far more cores, so that’d be really massive. Also, access to a Sandy Bridge-E cluster is possible, although not really probable. But we’ll see. If you’re interested in using x264 in a similar approach, you might want to check out the software versions that I used, these should be able to cope with rudely cut up slices quite well:

Also, if you require some guidance building that source code on Linux, please check out my guide:

If anybody knows a better way to slice up H.264/AVC elementary video streams, by all means, let me know! I would love to be able to have slices cut at proper keyframe positions including their own header, and I would also like to be able to reconcatenate slices to one file that is clean, having only one header at the beginning of the file and no damaged / to be dropped frames at their joints. So if you know how to do that – preferrably using Linux command line tools – just tell me, I’d be happy to learn that!

Edit: Thanks to [LoRd_MuldeR] from the [Doom9 Forums] I now have a way of splitting the input stream cleanly at GOP (group of pictures) boundaries, as the Elephants Dream movie is luckily using closed GOPs. Basically, it involves the widely-used [MKVtoolnix]. With that tool, you can just take the stream, split it to n MKV slices, and then either use those or extract the H.264/AVC streams from those slices, maybe using [tsMuxer]. Just make as many as your cluster has compute nodes, and you’re done!

By the way, both MKVtoolnix and tsMuxer are available for Windows and Linux, also MacOS X.

This is clean, safe and proper, other than my dirty previous approach!

Aug 252012

Intel 486 logoSo he finally did it. That ancient Intel 486 processor, equipped with Debian 4.0 Linux completed the [x264 benchmark] in an exceptional marathon of 8 months, 16 days, 21 hours, 29 minutes and approximately 59 seconds. He never crashed, no anomalies. You can decide by yourself if that’s despite or because of its age. In the HHHH:MM:SS:mmm format that would’ve been a runtime of 6261:29:58.872. Of course that is all pure C/C++ code, as there simply are no optimized assembler code paths for any 486, the chip doesn’t even have any prominent instruction extension, even MMX was far in the future when this chip was released.

There is one cool thing about it though, and that’s that it’s a SL-enhanced processor, hence the DX4-S/100 instead of just DX4/100. SL-enhanced processors were amongst the very first with power management features, so this chip can actually power down to a few milliwatts of power consumption when idle. That’s cool.

So here are the specifications of the system, just for reference:

  • Intel i486DX4-S/100 100MHz
  • ASUS PCI/I486SP3G v1.8, i423TX i420ZX (423TX+424ZX) Rev.4 Saturn-II chipset, 256kB L2 cache
  • 128MB EDO-DRAM (64MB cached)
  • s3 ViRGE/DX
  • Adaptec AHA-29160N SCSI controller
  • Seagate Cheetah 10.000rpm 36GB SCSI drive
  • 3Com Etherlink III ISA network card
  • FDD

Voodooalert style result line:

6261:29:58.872 | 0.023x | GAT | 1/1/1 | Intel i486 DX4-S/100 100MHz | 128MB EDO-DRAM (64M cached) | ASUS PCI/I486SP3G v1.8 | Debian 4.0 Linux (Custom GCC Build)

And here the final output screen of the webcam, that was monitoring the remote shell on another machine:

Final x264 output on the i486

And in case you want to see what the machine actually looks like, here you go:

So that’s it then! Next thing to do is to dist-upgrade the machine from Debian 4.0 Etch to 5.0 Lenny and finally to the last version with 486 kernel & userspace, the infamous Debian 6.0 Squeeze. But that’s another story. :)

Aug 232012

x264 LogoI always had the idea of actually visualizing the frametimes during the x264 benchmark to show where it’s fast, and where it’s not when transcoding the open source movie “Elephants Dream”. I have now finally succeeded in writing a Perl script that grabs a slightly modified x264 output. x264 can be told to output a small line of statistics for each frame transcoded. It writes that to the STDERR stream, which I decided to redirect to STDOUT and pipe into my Perl script, which in turn makes use of modern x86 systems HPET (High Precision Event Timer) using the Time::HiRes Perl module to fetch UNIX epoch timestamps for each successfully transcoded frame, and then calculates the time difference between each current and last frame. Here is the modified x264 calls for both passes, writing the frametimes to plain text files:

# Pass 1:
x264 --preset veryslow --tune film --b-adapt 2 --b-pyramid normal -r 3 -f -2:0 --bitrate 10000 --aq-mode 1 -p 1 --slow-firstpass --stats benchmark.stats -t 2 --no-fast-pskip --cqm flat -v elephantsdream_source.264 -o /dev/null 2&gt;&amp;1 | ./writestamps.pl ./frametimes-pass1.txt
# Pass 2:
x264 --preset veryslow --tune film --b-adapt 2 --b-pyramid normal -r 3 -f -2:0 --bitrate 10000 --aq-mode 1 -p 2 --stats benchmark.stats -t 2 --no-fast-pskip --cqm flat -v elephantsdream_source.264 -o /dev/null 2&gt;&amp;1 | ./writestamps.pl ./frametimes-pass2.txt

And here is the source code for that “writestamps.pl” script. It’s not exactly optimal, it would probably be better to first collect the time diffs in an array and only flush to disk at the end, but I didn’t have time to do that today. Even as it is now it already gives a good idea of the frame times. First the source, but please know that the usage of HPET with Time::HiRes does not work on Windows, it’s simply not implemented there, so this is UNIX/Linux/BSD only I guess, no nanosecond precision on Windows, at least not with Time::HiRes:

use strict;        # Perl code is strict, no dodgy stuff!
use Time::HiRes;   # Highres Time functions (nanosecond precision)
my $localtime = Time::HiRes::clock_gettime(); # UNIX epoch timestamp
my $oldtime = $localtime;                     # Variable for timestamp of last frame
my $stampfile = $ARGV[0];                     # File name to write timestamp differences to
my $diff;                                     # Time difference between frames
# Opening timestamp file:
open(STAMPFILE, "&gt;:encoding(UTF-8)", $stampfile) || die "$stampfile could not be opened! Error: $!";
while (my $line = &lt;STDIN&gt;) {                 # As long as STDIN is coming from x264..
  if ($line =~ m/bytes/) {                     # Check if line is a per-frame statistic line. If so..
    $localtime = Time::HiRes::clock_gettime();   # Read current UNIX epoch timestamp
    $diff = $localtime - $oldtime;               # Calculate time difference between this frame and last frame
    $oldtime = $localtime;                       # Current frame is becoming old frame
    printf(STAMPFILE "%f", $diff);               # Printing time diff to file in fixed-point format
    printf(STAMPFILE "\n");                      # Printing UNIX line break to file
close(STAMPFILE);                            # Closing timestamp statistics file.

And this is what we get from it:

Pass 1:

x264 benchmark frametimes pass 1

Pass 2:

x264 benchmark frametimes pass 2

So, I have just updated the code to not do I/O in every iteration for each line that comes piped from x264. Now, those data is being collected into an array, and only flushed to disk at the very end, which should make the data more representative without any I/O disturbing the measurements. Graphs will be updated when the run based on this new code is completed:

use strict;           # Perl code is strict, no dodgy stuff!
use Time::HiRes;      # Use HPET timer for nanosecond precision timestamps.
my $localtime = Time::HiRes::clock_gettime(); # Read current UNIX epoch timestamp.
my $oldtime = $localtime;                     # Define stamp of previous frame.
my $stampfile = @ARGV[0];                     # Fetch filename of timestamp file from cli.
my $diff;                                     # Time difference between frames.
my @diffs = ();                               # Array that holds time diffs before final flush to disk.
while (my $line = &lt;STDIN&gt;) {                # As long as there is STDIN coming from x264..
  if ($line =~ m/bytes/) {                    # Check if line is indeed a per-frame output line.
    $localtime = Time::HiRes::clock_gettime();  # Read UNIX epoch timestamp for current frame.
    $diff = $localtime - $oldtime;              # Calculate diff between current and previous frame.
    push (@diffs, $diff);                       # Push difference into array for later extraction.
    $oldtime = $localtime;                      # Current frame becomes old frame.
# Open timestamp file:
open(STAMPFILE, "&gt;:encoding(UTF-8)", $stampfile) || die "$stampfile could not be opened! Error: $!";
foreach (@diffs) {             # Iterate through array of time differences per frames.
  printf(STAMPFILE "%f", $_);    # Print time diffs to file in fixed-point format, so we don't
  printf(STAMPFILE "\n");        # accidentally get scientific notation. Then print UNIX line break.
close(STAMPFILE);              # Close timestamp file.

Additionally to the graphs seen above, user “Elianda” from [Voodooalert]German flag has smoothed out the data of both passes to 500 instead of the full 15691 data points and gave me a software called “Igor 6”, which I used to create some more readable plots, the data here is already based on the new version of my Perl script, which is no longer affected by disk I/O:

Pass 1:

x264 benchmark frametimes smoothed pass 1

Pass 2:

x264 benchmark frametimes smoothed pass 2

So much for visualizing that frametime data!

May 102012

I have tested an sgi Altix 350 shared memory cluster with 10 Intel Itanium² CPUs once, running the [x264 benchmark] on it, and here’s another. Prof. Ludwig and Mr. Otto from the [Chair of Simulation and Modelling of Metallurgic Processes] (SMMP) at the University of Leoben have agreed to let me benchmark their even larger Altix 350 with 16 processors. Now while I have already done that successfully using the GNU C compiler (GCC), performance was a little bit sub-par and rather unsatisfactory, only slightly faster than its smaller counterpart, here [the filtered result]. Now on the previous sgi Altix machine I had access to the rather hairy ICC 10.1, the Intel C/C++ compiler. It did give around 10% performance boost back then after I finally managed to build x264 with it, so I wanted to try that once again.

Unfortunately, it’s not so easy to get access to ICC. The newest version 12.0 is not available for the IA64 (Itanium) architecture anymore, and you can’t even get to a proper trial license generator for Linux on Itanium. The latest version for Itanium is 11.1.080. Even the download is almost impossible to locate by regular means on Intels website, you can get it after logging in to [registrationcenter.intel.com] if you know what you’re looking for and have an ICC 12.0 trial license tied to that account already. So after some support emailing I registered at [premier.intel.com] (You get a link for that when downloading the regular ICC 12.0 trial for x86). There I opened a support ticket to get a proper trial license.

The supportsperson did his best to generate a license for me, but the compiler installer just wouldn’t accept them. In the end he found that he does not have the proper tools anymore to generate valid IA64 licenses, so he forwarded the issue to internal registration and license management within Intel. He said, they have to have the proper set up to still generate a valid trial serial number / license for Linux on Itanium.

There you see, the Itanium ship seems to really be sinking if you can’t even get any Intel software trials for it anymore. I still hope I can get the 11.1 compiler working as that would be probably the best Itanium² result we’re ever going to see from that kind of IA64 shared memory cluster platform.

Feb 092012

Silicon Graphics Altix 350After doing some testing on the Intel C compiler (ICC) and also compilation of libav/x264 without root privileges, I finally got access to the real deal! For that, my thanks go to Prof. Supancic and Mr. Flicker from the [Institute for Structural and Functional Ceramics] at the University of Leoben.

So, the machine: A Silicon Graphics Altix 350, equipped with 5 modules with 2 processors each. That makes for a total of 10 Intel “Madison” Itanium² processors clocked at 1.5GHz and packed with 4MB cache each. The memory subsystem consists of 56GB Reg. ECC DDR-I/333 memory and the storage backend is SCSI. The entire machine consumes roughly 5000W of power and runs on SuSE Linux Enterprise Server or SLES, version 10.

So much for the specification mumbo jumbo, for an idea how a fully packed Altix 350 half-height rack would look like, see the thumbnail picture. Yes, it’s big. So, was it easy to get x264 to work? Well, nah-ah, it wasn’t. Continue reading »