Before getting into
details, I though to start with CPU concepts and technologies. This is to make sure that we are all at the same base line.
PS: You will find many of my notes and understandings about CPU virtualization are based on Frank Denneman posts as I was using them as reference in addition to VMware docs and technical papers. Many thanks to Frank :)
PS: You will find many of my notes and understandings about CPU virtualization are based on Frank Denneman posts as I was using them as reference in addition to VMware docs and technical papers. Many thanks to Frank :)
Thread
It is an ordered
sequence of instructions that tells the computer what to do. Let's examine a
machine with four cores.
Each process running
in the OS will initiate threads executed by the processors. In our example
below WMI Provider Host service is
initiating 10 threads at the moment. Those will be divided among the 4 cores
for execution. Also, you can notice the count under CPU column is 04. This means
that those 10 threads are occupying 4% of the CPU since the last sampling time
(given 4% of the execution cycles which is the amount listed under CPU Time column).
You
can use CPU affinity to attach a process to one or more cores. Other cores
won't execute threads from this process.
Similarly, you can
increase the priority of a process to be given more execution cycles when contention takes place. Without
contention, priority won't be applied.
Symmetric Multi-Processing (SMP)
We mentioned in the
previous section that each process will initiate threads which will be executed
by all cores. Also, we mentioned in our example that WMI Provider Host process is
initiating 10 threads which will be executed by 4 cores. This is called Symmetric Multi-Processing.
SMP (symmetric
multiprocessing) is the processing of programs by multiple processors that
share a common OS and memory. In SMP, the processors share memory and the I/O
bus or data path. A single copy of the OS is in charge of all the processors.
Windows balances the
threads to multiple CPUs as evenly as possible among all programs.
Ordinarily, a
limitation of SMP is that as cores are added, the shared bus or data path to
memory get overloaded and becomes a performance bottleneck. In other words,
what is the benefit of adding more processors and increasing processing speed
if the memory can't run at the same speed due to memory or bus limitation.
Non-Uniform Memory Access (NUMA)
The main purpose of
NUMA is to solve the limitation of SMP.
NUMA will create a
cluster of processors sharing a local memory unit so that all data accesses
don't have to travel on the main bus.
A typical cluster
consists of four cores interconnected on a local bus to a shared memory (called
an L3 cache). The cluster including all components resides on a single card.
This unit can be added to similar units to form NUMA SMP system in which a common
SMP bus interconnects all of the clusters. To an application program running in an
SMP system, all the individual processor memories look like a single memory,
i.e. the application won't distinguish whether this is a NUMA SMP or SMP
system.
Here is a more
detailed look and NUMA SMP system which has 8 cores which has HyperThreading (HT) enabled (HT will be covered in the next section).
When a processor
looks for data at a certain memory address, it first looks in the L1 cache on
the core itself, then on a somewhat larger L2 cache nearby the core, and then
on a third level of cache that the NUMA configuration provides before seeking
the data in the "remote memory" located near the other cores.
Therefore, there is
still a possibility that a core reads from remote memory (L3 cache) with NUMA
system. This adds a latency due to the extra path to the remote memory. The OS
can play a vital role here by making sure that the memory space allocated is
local to the core that is going to execute the thread to avoid loading SMP main
bus.
Note:
Intel Nehalem and AMD’s veteran Opteron are NUMA architectures
Simultaneous Multi-Threading (SMT)
Simultaneous Multi-Threading (SMT)
Its also called
Hyper-Threading (HT). It’s a technology which divides a physical core into two
logical processors. Those two logical processors can execute instructions from
two independent threads simultaneously.
As we all know that
a single core can't execute two threads simultaneously which is very logical.
This is the reason why we are measuring the CPU Utilization of each thread
based on CPU Time and Execution Cycles.
Now using SMT or HT
this is doable.
One point to make it
clear that HT won't doable the performance of the system due to the fact that
both logical processors are sharing most of the core's resources such as memory
cache and functional unit. However, you will see an increase in the performance
due to proper utilization of idle resources leading to greater throughput. On
other hand some applications might face performance degradation due to the
shared resources.
No comments:
Post a Comment