Thursday, 15 November 2012

CPU Virtualization - Prerequisites


Before getting into details, I though to start with CPU concepts and technologies. This is to make sure that we are all at the same base line.

PS: You will find many of my notes and understandings about CPU virtualization are based on Frank Denneman posts as I was using them as reference in addition to VMware docs and technical papers. Many thanks to Frank :)

Thread
It is an ordered sequence of instructions that tells the computer what to do. Let's examine a machine with four cores.
 
Each process running in the OS will initiate threads executed by the processors. In our example below WMI Provider Host service is initiating 10 threads at the moment. Those will be divided among the 4 cores for execution. Also, you can notice the count under CPU column is 04. This means that those 10 threads are occupying 4% of the CPU since the last sampling time (given 4% of the execution cycles which is the amount listed under CPU Time column).
You can use CPU affinity to attach a process to one or more cores. Other cores won't execute threads from this process.
 
Similarly, you can increase the priority of a process to be given more execution cycles when contention takes place. Without contention, priority won't be applied.

Symmetric Multi-Processing (SMP)

We mentioned in the previous section that each process will initiate threads which will be executed by all cores. Also, we mentioned in our example that WMI Provider Host process is initiating 10 threads which will be executed by 4 cores. This is called Symmetric Multi-Processing.

SMP (symmetric multiprocessing) is the processing of programs by multiple processors that share a common OS and memory. In SMP, the processors share memory and the I/O bus or data path. A single copy of the OS is in charge of all the processors.

Windows balances the threads to multiple CPUs as evenly as possible among all programs.

Ordinarily, a limitation of SMP is that as cores are added, the shared bus or data path to memory get overloaded and becomes a performance bottleneck. In other words, what is the benefit of adding more processors and increasing processing speed if the memory can't run at the same speed due to memory or bus limitation.

Non-Uniform Memory Access (NUMA)

The main purpose of NUMA is to solve the limitation of SMP.

NUMA will create a cluster of processors sharing a local memory unit so that all data accesses don't have to travel on the main bus.
A typical cluster consists of four cores interconnected on a local bus to a shared memory (called an L3 cache). The cluster including all components resides on a single card. This unit can be added to similar units to form NUMA SMP system in which a common SMP bus interconnects all of the clusters. To an application program running in an SMP system, all the individual processor memories look like a single memory, i.e. the application won't distinguish whether this is a NUMA SMP or SMP system.

Here is a more detailed look and NUMA SMP system which has 8 cores which has HyperThreading (HT) enabled (HT will be covered in the next section).
When a processor looks for data at a certain memory address, it first looks in the L1 cache on the core itself, then on a somewhat larger L2 cache nearby the core, and then on a third level of cache that the NUMA configuration provides before seeking the data in the "remote memory" located near the other cores.

Therefore, there is still a possibility that a core reads from remote memory (L3 cache) with NUMA system. This adds a latency due to the extra path to the remote memory. The OS can play a vital role here by making sure that the memory space allocated is local to the core that is going to execute the thread to avoid loading SMP main bus.

Note: Intel Nehalem and AMD’s veteran Opteron are NUMA architectures


Simultaneous Multi-Threading (SMT)

Its also called Hyper-Threading (HT). It’s a technology which divides a physical core into two logical processors. Those two logical processors can execute instructions from two independent threads simultaneously.

As we all know that a single core can't execute two threads simultaneously which is very logical. This is the reason why we are measuring the CPU Utilization of each thread based on CPU Time and Execution Cycles.

Now using SMT or HT this is doable.

One point to make it clear that HT won't doable the performance of the system due to the fact that both logical processors are sharing most of the core's resources such as memory cache and functional unit. However, you will see an increase in the performance due to proper utilization of idle resources leading to greater throughput. On other hand some applications might face performance degradation due to the shared resources.

No comments:

Post a Comment