Working Set

Tam Lin · May 28, 2016

In an old thread, now deleted, Blizzard outlined the best active system. He thought it took a 12-core Xeon processor with one CPU core dedicated to each of eight audio channels and the remaining four cores dedicated to a real-time OS. Based on my experience programming high performance computers I thought his proposed configuration was absurd and said so. However, at the time, I didn’t have any numbers to support my contention so I let it slide.

I recently studied CPU affinity, in detail, and gathered some numbers. The software is a foobar DSP plugin that resamples stereo PCM to 512Fs. I timed three configurations on two different CPUs. The CPUs are a 3.0 GHz Q9650 and a 4.0 GHz I7-6700K. The configurations are:

A. Any available core chosen by the OS.
B. One dedicated core per channel.
C. Either of 2-cores selected by function.

Configuration A yielded CPU utilizations of 35% and 12% for the Core2 and I7, respectively.
Configuration B yielded CPU utilizations of 37% and 14%.
Configuration C yielded CPU utilizations of 30% and 8%.

It does not surprise me that configuration B is the slowest because a dedicated core per channel keeps the scheduler from making the best use of the available system resources. Configuration A suffers because with only two channels it’s hard to utilize more than two cores at once. Configuration C performs best because it has the smallest, per-thread, working set. Unfortunately, working set is not part of an audiophile’s vocabulary or understanding. Without understanding, audiophiles are doomed to make bad choices.

Steve Williams · May 28, 2016

That's valuable information Tam. Thanks for taking the time to do those tests. Very interesting results.

Geardaddy · May 28, 2016

Tam Lin said:
In an old thread, now deleted, Blizzard outlined the best active system. He thought it took a 12-core Xeon processor with one CPU core dedicated to each of eight audio channels and the remaining four cores dedicated to a real-time OS. Based on my experience programming high performance computers I thought his proposed configuration was absurd and said so. However, at the time, I didn’t have any numbers to support my contention so I let it slide.

I recently studied CPU affinity, in detail, and gathered some numbers. The software is a foobar DSP plugin that resamples stereo PCM to 512Fs. I timed three configurations on two different CPUs. The CPUs are a 3.0 GHz Q9650 and a 4.0 GHz I7-6700K. The configurations are:

A. Any available core chosen by the OS.
B. One dedicated core per channel.
C. Either of 2-cores selected by function.

Configuration A yielded CPU utilizations of 35% and 12% for the Core2 and I7, respectively.
Configuration B yielded CPU utilizations of 37% and 14%.
Configuration C yielded CPU utilizations of 30% and 8%.

It does not surprise me that configuration B is the slowest because a dedicated core per channel keeps the scheduler from making the best use of the available system resources. Configuration A suffers because with only two channels it’s hard to utilize more than two cores at once. Configuration C performs best because it has the smallest, per-thread, working set. Unfortunately, working set is not part of an audiophile’s vocabulary or understanding. Without understanding, audiophiles are doomed to make bad choices.

So what explicit recommendations can you make and why?

Tam Lin · May 28, 2016

Geardaddy said:
So what explicit recommendations can you make and why?

Study the flow of the code and choose affinity to minimize the size of the working set. That will reduce the number of cache and TLB misses. With modern CPUs, cache and TLB misses are a huge penalty because they can stall the entire execution pipeline. Reworking the code in this way also increases the opportunities for parallelization and reduces serialization bottlenecks.

Tam Lin · Jun 27, 2016

Working set was an important metric in the early days of virtual memory. Back then, computer memory was measured in kilobytes and every byte mattered. Over the years as memory sizes increased, working set became less important. However, with today’s multi-core CPUs, total working set, including caches, must be considered for high performance systems.

Consider Blizzard’s recommended system. It sounds good to the macho, overkill, audiophile mindset: 12-core Xeon with a dedicated CPU core for each audio channel. However, with good programming practice and attention to total working set, better performance can be achieved with an I5.

In Blizzard’s system, the dedicated cores will be operating in lockstep/parallel. Each core will have identical copies of the operating code and data resident in their caches at all times. Each time a thread is blocked, by a cache or TLB miss, for example, the associated core will stop. And so will each of the other cores because they are running an identical program execution path. Therefore, each cache miss results in eight cores stopping and eight cache lines being loaded. But that’s not all. Each audio sample is a 32-bit floating-point number and each 8-channel sample frame is 32 bytes. Thus, there are two sample frames per 64-byte cache line. For each audio sample cache miss, eight cache lines will be loaded but only two audio frames will be processed. That is a very poor utilization of available bandwidth.

What I just described is apparent from the benchmark numbers I posted. However, those numbers were for two cores; eight cores will be four times worse. The smart way is one thread per channel all running in the same core or two, sequential threads per channel pipelined across two cores.

In case you missed the details, the benchmark I used resamples a stereo, PCM audio file to 24,576,000 samples per second and it runs just fine on a 1.7 GHz, I5-4210U. Although Intel calls that CPU an I5, it has two cores with hyper-threading, like an I3, and turbo-boost, like an I7.

Tam Lin · Feb 19, 2017

I recently observed a vivid demonstration of the effects of working set. I was optimizing an experimental foobar plugin when I noticed the CPU utilization seemed to follow the volume envelope of the music. When the music was loud, the CPU utilization was high. When the music was quiet, the CPU utilization was low. The CPU utilization was lowest during inter-track silence. Of course, the CPU utilization was zero when the music stopped.

See the attached screen shot. The music starts just after the first vertical grid line. The track is “Prayer After the Canon” from Arvo Part’s “Kanon Pokajanen.” That segue ways to “Smooth” with Rob Thomas and Carlos Santana, at grid line 29. Both the CPU utilization and the VU meter are pegged for the duration. After a couple minutes, at grid line 33, I turned off a feature in the plugin and the CPU utilization dropped to a more reasonable level. After another minute or two, at grid line 38, I turned the music off.

What is going on? Every sample passes through the same CPU instruction stream. The feature I turned off changes just one CPU instruction, out of hundreds, per sample. Foobar reads the music file from the disk. After flac decoding, the samples go to my plugin. There they are resampled to 705.6K or 768K, depending on the base sample rate, and replicated 32 times. Then each copy goes through a different 16 MB lookup table. The DAC uses 32 PCM1704K per channel. Adjusting the sample data fed to each of the parallel chips corrects linearity errors in the summed output current.

It is the lookup that expands the plugin’s working set to as much as 1 GB! In current Intel CPUs, each core has a dedicated L1 and L2 cache but share a common L3. The lookup thrashes L3 such that nearly every table access involves reading from main memory, which is very slow.

Search

Search

Working Set

Tam Lin

Well-Known Member

Steve Williams

Site Founder, Site Co-Owner, Administrator

Geardaddy

Well-Known Member

Tam Lin

Well-Known Member

Tam Lin

Well-Known Member

Tam Lin

Well-Known Member

Attachments

Similar threads