For years, we’ve been hearing about the promise of GPGPU (general purpose GPU) computing, the process of having operations normally run by the CPU crunched in the GPU. These are usually tasks that lend themselves particularly well to floating point-intensive parallelization. There needs to be a large amount of arithmetic done per memory fetch to truly harness GPU-based processing, and the application needs to perform such operations on multiple pieces of data simultaneously. Essentially, GPU shaders can take on the role of floating point units and perform these parallel computations simultaneously—dozens or hundreds of mini-processors working in tandem. This idea took root in the scientific and engineering fields, where high-end servers had memory bandwidth that could put GPGPU operations to fruitful use. Apps for geological modeling and molecular visualization, when coded to take advantage of GPGPU functionality, started seeing magnitudes of performance improvement. There was no reason that signal processing couldn’t also fill the bill, so researchers optimized the SETI@home and Folding@home distributed computing applications to make use of the GPGPU potential in consumer-level systems. Modeling proteins on your PC is a worthy pursuit in the advancement of health and science, but it doesn’t really let you use GPGPU technology in your everyday life. Where was the promise of 10X acceleration for everyday apps? Last year, we started to get an answer, when Nvidia pushed Elemental Technologies’ Badaboom video converter into the spotlight in order to showcase CUDA for the masses. AMD soon followed suit when it exposed ATI Stream GPGPU capabilities through its Catalyst 8.12 driver update in December and with it the ATI Avivo Video Converter. Imaging and video rendering fit well with GPGPU’s capabilities. This is why nearly all of the GPGPU-optimized apps you see in the consumer market are either video transcoders or video editors, which can render effects and then transcode during output to a final video file. A CPU might be able to dedicate eight threads to this task, but a modern GPU can have practically every one of its programmable shaders crunching away on separate operations. Given that GPUs now have hundreds of such shaders, or stream processors, the efficiency gains vs. a traditional CPU in specific tasks can be massive. Why tie your system up for hours on a movie transcode job when it could take only minutes? The GPGPU fortunes of AMD and Nvidia have traveled somewhat different paths since last winter. When the Catalyst 8.12 launch bombed due to fundamental quality flaws in the transcoding process, Nvidia moved to push its fledgling lead. But with AMD's recent release of its Catalyst 9.5 drivers, the promised cure for these past quality issues is here. Today, we have both vendors enabling GPGPU-accelerated applications. We can’t lie and tell you there are aisles filled with such apps. In fact, there are probably less than a dozen today. Still, it’s a good start, and if you want to dip your hand into the rushing waters of GPGPU optimization, consider the following titles the best places to begin.  ArcSoft TotalMedia Theatre 3 with SimHD Plug-In TotalMedia Theatre 3 with SimHD Plug-In $89.99 (TMT3 Platinum); $19.95 (SimHD) ArcSoft www.arcsoft.com CPUs: 2 TMT3 is a universal player, much in the vein of WinDVD, PowerDVD, and a host of others. The software supports just about any SD or HD format you could ask, including Blu-ray Disc. For the sake of GPGPU discussion, the item of interest here is ArcSoft’s SimHD plug-in. This is an upconversion tool designed to take your standard-def content and upconvert it to 720p or 1080p—exactly what you want for a home-theater system displaying on a delicious flat-panel TV. Most Blu-ray players now come with this capability, but you might expect a dedicated PC program to provide better results. In this case, you would be wrong. We tested SimHD with a DVD copy of “Matrix Revolutions” as well as some MPEG-2 720p content. The results were similar. As you can see in the capture of our standard-def MPEG-2 content to 1080p, SimHD delivers a mixed bag. On one hand, the plug-in boosts contrast and color richness. Colors pop more. Whites look whiter, and darks look darker. Because of this, SimHD tends to make detail in shadowy areas more pronounced. This is good stuff. In most instances, we didn’t perceive that SimHD was overprocessing the image and leaving it looking artificially “boosted.” However, although ArcSoft may be increasing each frame’s pixel count, SimHD’s antialiasing appears to be non-existent, even though we only selected the default medium level of image sharpening. Jaggies become much more noticeable, and the amount of artifact noise introduced into scenes is jarring. This is a function of ArcSoft, not the underlying Stream or CUDA driver: We observed the same problems with and without GPU acceleration. Admittedly, running on the CPU, SimHD yielded about 60% CPU utilization while hitting only 13% or so when GPU acceleration was selected. Clearly, Stream and CUDA are shouldering a significant amount of work. The unfortunate thing is that the work they’re being given looks terrible. If that wasn’t enough, we observed multiple playback glitches under SimHD, and the on-screen capture tool was inexplicably disabled during SimHD playback. We don’t doubt that future versions of SimHD, as well as competing titles along the same line, will yield better results. Today, though, SimHD merely shows GPGPU’s promise, not its rewards.  CyberLink MediaShow Espresso MediaShow Espresso $39.95 CyberLink www.cyberlink.com CPUs: 4 We already reviewed CyberLink’s PowerDirector 7 in our May 2009 issue (see page 80), so we won’t delve into detail here other than to use PD7 as an illustration that not all GPGPU capabilities are implemented uniformly by software developers. In this case, PD7 uses CUDA to accelerate both video encoding (producing final video output) as well as rendering effects/filters during editing. As of this writing, PD7 only uses Stream to accelerate encoding. The additional decode and scaling capabilities that were half of the point behind Stream’s Catalyst 9.5 update in May have not yet been implemented. So if you’re a PD7 fan, CUDA will currently get you a broader set of accelerated features. Now, the encoding engine built into PD7 is also the core of CyberLink’s new transcoder app, MediaShow Espresso, which also supports both Stream and CUDA. While not quite as slick-looking as Badaboom, Espresso delivers more functionality. You can browse for files and folders or simply drop your source content straight into Espresso’s main project area. Then, select a general profile (Apple, Sony, or Microsoft devices; YouTube; and Custom) and a target resolution from the following menu. You can amass a long list of transcode jobs and launch them as a batch. Espresso can crunch up to four jobs simultaneously. In some ways, the true mark of a great transcoder is that it can do whatever a power user asks but remain accessible to novices. MediaShow Espresso is exactly this. For newbies, the hardest part of using this program will be remembering to pick the right folder for target files, assuming they want someplace besides CyberLink’s default location. Experts will find the profile customization tools for bit rates and codecs more than sufficient. Given the agnostic adoption of Stream and CUDA, not to mention 8-way thread support for Core i7 owners, this may be the best transcoder app available today.  Pande Group Folding@home Folding@home Free Pande Group folding.stanford.edu There are two distributed computing apps with support for ATI Stream and CUDA: Folding@home and SETI@home. Folding@home will no doubt strike a deeper chord with those who’ve had loved ones afflicted with diseases such as Alzheimer’s, cystic fibrosis, or any malady believed to involve the misfolding of proteins. It's hoped that having a better understanding of how proteins fold will help lead to treatments and possibly cures of these afflictions. Because each new workload downloaded from the Folding@home server is different, it’s very difficult to benchmark the application. However, we’ve seen GPU-accelerated improvements with SETI@home of roughly 6X to 9X vs. CPU-only processing. Folding@home supports both ATI Stream and CUDA, but make sure you download the High Performance client version (folding.stanford.edu/English/DownloadWinOther) for your GPU.  LoiLo Super LoiLoScope HD Super LoiLoScope HD $69 LoiLo loilo.tv CPUs: 3 In terms of general approach and model, if you’ve seen one video editor, you’ve seen them all. And then there’s LoiLoScope. Sure, LoiLoScope does what other editors do; you can easily trim clips, amass clips into video projects, add effects, place items on a timeline, add titles, and export to a wide range of profiles, including YouTube HD. Like Badaboom, H.264 encoding can be CUDA-accelerated. What makes Super LoiLoScope HD different is its interface—not just that it looks like a game export from Konami or Nintendo but the way it conceptually represents project space. The environment is essentially an infinite plane you can zoom into and out of. Imagine a cutting room floor where you’re actually working on the floor, only there are no walls. You can throw clips, photos, and audio files into loose piles, then use “magnets” to bind them into more defined projects. We don’t recommend LoiLoScope because it does a better job leveraging CUDA than, say, PowerDirector 7. It doesn’t. In fact, CyberLink uses CUDA for accelerating filter renders, and LoiLo does not. We recommend LoiLoScope as a style option. Not all creative people lean toward the workflow-oriented paradigm of traditional video editing. LoiLo takes a looser, more “right-brain” approach, if you will. If that’s your bag, then at least give the free trial a spin. ATI Stream support is expected soon, perhaps even by the time you read this.  Nero Move it Move it $39.99 Nero www.nero.com CPUs: 3.5 With Move it, Nero takes its own spin on the transcoder concept. The idea here is that some people care more about synchronizing media among their devices than they do about codecs, resolutions, and all the other minutia of transcoding. Nero gives you that level of control in its options menus, but the main UI uses a bisected design in which you have your source device on the left and your target device on the right. The PC doesn’t have to be the focus; you could transfer from a digital camera to a PSP, for example. If you are synching the PC, you tell Move it which folders you want to monitor, and thumbnails of the media files in these folders appear in the main area on the left. You can sync or transcode files individually or in groups, either batch selected or by folder. Move it will assemble a queue list for transcoding and can encode up to four files simultaneously—definitely one of its best features. As a cross-device media organizer, Move it is a great application. We’re not in love with its interface, although you might be. Functionally, we have no complaints. It would be cool if somehow future versions could work on a ring or spoke/hub concept, so multiple devices could cross-sync. We can’t put our fingers on exactly what bugged us about the UI, but somehow it felt cumbersome; the order of events didn’t always flow smoothly from point to point. Still, just having a good transcoder with batch capabilities and CUDA acceleration for $39.99 as a direct download makes a lot of amends. Real-World Results The burning question, of course, is whether Stream and CUDA really deliver substantial acceleration. To find out, we ran a handful of tests that sought to determine two things. First, how much faster is encoding with GPU acceleration vs. without? Second, how does Stream stack up against CUDA? On the surface, the first question is much easier to address. You simply need to take a source file and transcode it into a different format. Any GPGPU-enabled transcoder or video editor should do this. However, there are several variables to consider. If you apply effects to the video, such as sharpening, tone changes, blurring, etc., these may be accelerated under one platform and not another, as we see in PowerDirector 7. Next, the application may not offer a simple GPU acceleration on/off switch. Badaboom is a good example. A common test scenario with Badaboom is to trans-code high-def content into a device-specific portable format, such as iPod video. But Badaboom won’t let you turn off GPU acceleration, so you have to try to find a similar test for the same transcode operation under a non-accelerated app. The usual choice here is iTunes. We tried this comparison with a GeForce GTX 260 and an 11.5MB MP4 file pulled from YouTube with the KEEPVID bookmarklet. iTunes reported the H.264 (AVC) file as having a 124Kbps audio bit rate, low complexity, stereo, and a total bit rate of 624Kbps at 480 x 270. We then converted it to iPod format in iTunes. The resulting file had these stats: MP4, 25MB, 124Kbps audio, low complexity, 1,371Kbps total bit rate, and 480 x 270 H.264, marking an actual upconversion from the source file. Conversion time was 1:16 (minutes:seconds). We then used Badaboom 1.1.1 with advanced settings to set a target resolution of 480 x 272, audio mix of auto, and bit rate to 1,370Kbps. The resulting file specs show: MP4, 26.7MB, 124Kbps audio, low complexity, stereo, 1,474Kbps total bit rate, and 480 x 270 H.264. Conversion took 26 seconds with GPU acceleration in Badaboom, compared to 1:16 using CPU-only in iTunes. It’s pretty clear that GPU acceleration pays big dividends, but it’s not an exact apples-to-apples scenario. For a better comparison, we turned first to PowerDirector 7, which supports encoding acceleration for both Stream and CUDA. But even this turned out to be problematic. It seems that the current version only enables Stream acceleration under two profile modes called Video Preservation - High Definition (AVC, 1080p, 14.8Mbps) and Video Preservation - Standard Definition (AVC, 720 x 480, 6Mbps). These modes don’t appear when you’re running an Nvidia-based card. Worse yet, if you check the “Use SVRT To Save Rendering Time” box, Stream support won’t be utilized. Fortunately, PD7 features a sort of profile cloning function. If you pick a Video Preservation profile and click the New button, it will keep all of the settings contained in the current profile and allow you to save them under a new name. This cloned profile will appear when you switch to CUDA. We used three source files, a 22MB YouTube HD video (MPEG-4, 1,280 x 720), the 115MB WMV HD “Terminator 2” trailer, and a 48MB, 720 x 480 MPEG-2 clip. We output the first two to our Video Preservation standard and high-def PD7 profiles and the third to an iPod profile, just to have a different profile in the mix. We ran these tests on an AMD Phenom II X4 955 platform with 2GB of RAM and, in turn, a Radeon HD 4890 or GeForce GTX 280. Across these three tests, we see Stream and CUDA performing very similarly when encoding to lower resolution targets. But when we move up to the high-def profile CUDA pulls away from Stream by a sizable margin of greater than 30%. ATI and CyberLink both cautioned us about using PD7, but as the encoding engine within the latest Stream drivers remains unchanged from the December Stream launch, we’re confident that this is a fair test of current capabilities. Next, we used Espresso to look specifically at the benefits of GPGPU over CPU processing when upconverting our MPEG-2 camcorder footage into 1,440 x 1,080 MPEG-4. Stream cuts the encode time in half, but CUDA slices it by two-thirds. Point for Nvidia. But going the other direction, downconverting from YouTube HD MPEG-4 to the iPod Touch 480 x 270 profile, CUDA shows about 30% improvement to the output time—less than we expected—while Stream finished the task 2.7 times faster than our CPU. The answer likely lies in acceleration offered by Nvidia’s PureVideo HD engine, not CUDA. AMD points out that interpolating from lower resolution to a higher one is not a typical real-world scenario, as there’s no filtering happening to improve the image quality, only an increase in pixels that will result in a lower quality output. Hence, AMD spent no effort in trying to accelerate something no one would realistically want. Finally, we ran several tests through Nero Move it, which only supports CUDA at present. We were surprised at how little Nero utilized GPU acceleration in our tests. This doesn’t mesh with results we’ve seen in other CUDA apps, and we have to wonder if Nero needs to issue a patch in order to improve its utilization. by William Van Winkle
Quick CUDA Q&A With Michael Steele, General Manager Of Digital Consumer Solutions at Nvidia CPU: Once and for all, is CUDA proprietary? MS: CUDA is almost like x86. It’s an architecture. OpenCL is just another approach to run on top of all of our Nvidia GPUs. In fact, we were the first to submit a working spec to the OpenCL committee. There’s been some interest from others who confuse the market into believing we’re somehow proprietary and don’t support OpenCL, but it would be crazy for us not to. Also, Microsoft’s DirectX Compute is coming. Microsoft is a very big proponent of GPU computing, and we were the first to provide a working driver for GPU computing supporting DirectX Compute. CPU: What more could we soon see GPU computing doing for consumers that isn’t on the market already? MS: There’s more you could do with playback, like streaming video. Whether it’s coming through Silverlight, Flash, or anything else, there’s a lot of compute potential there. There’s room for video enhancement, effects, upscaling, and convergence that can all be applied to streaming. You can even improve video conferencing. The digital photo potential is huge. Everybody has thousands of photos on their PC. They want to edit them, clean them up, and flat out find them. Face recognition fits perfectly there. Apple and Google have face recognition as part of their applications today. I’m pretty sure you’re going to see GPUs accelerating face recognition in the next six to 12 months. Finding a face and then comparing it to a database is a tremendous application for parallel processing. If you look at all the papers written about CUDA, there are more written on [face and gesture recognition], than just about any other application. When you do heavy lifting on your PC with these kinds of tasks, it often chews up 100% of your CPU, and even then it may not be giving you the results you want. The GPU can usually do it much better and faster, and it can help lower the CPU utilization. CPU: So why am I seeing the CPU usage still hammered—still well over 80%—during these GPU-accelerated processes? MS: It depends on the specific application and also the approach the [software developer] takes. Some ISVs prefer to do almost everything on the CPU. Others take more of a hybrid approach. There may be a pipeline of activity involved and they want to keep those resources as low as possible, and sometimes they’ll max out everything. In general, though, I think you’ll see a lot of offloading. [Media player apps are an example.] PowerDVD on Ion can max out the CPU, but turn on GPU acceleration and you’ll drop to about 30%. |
|