Summary: Linux Scheduler: A decade of wasted cores - Part 1 - What is NUMA ?

I have been reading Hacker News (HN) for the past 3 years and I love it for its amazing content and discussions.

Last month, a research paper with title 'The Linux Scheduler: a Decade of Wasted Cores' was trending on the front page of HN. As an individual who is interested in Systems, I thought it would be good idea to read this 16 page research paper. I spent a good amount of time learning about different topics which were involved in it. This is the first post in the series in which I will try to summarize the paper.

What is NUMA ?

A CPU socket is the physical connector that connects a CPU chip to the motherboard. Most of the laptops and desktops only have one CPU socket but that is not the case with modern servers. Most of the modern server configuration have 2 or more CPU sockets. Take this SeaMicro system as an example. It has 64 quad core processors which means that it has 64 sockets and each one has a quad core processor in it.

AMD 20 socket test board

AMD 20 Socket Test Board

NUMA stands for Non Uniform Memory Access. NUMA is used in a symmetric multiprocessing ( SMP ) system. An SMP system is a "tightly-coupled," "share everything" system in which multiple processors working under a single operating system access each other's memory over a common bus or "interconnect" path. Ordinarily, a limitation of SMP is that as microprocessors are added, the shared bus or data path get overloaded and becomes a performance bottleneck. NUMA adds an intermediate level of memory shared among a few microprocessors so that all data accesses don't have to travel on the main bus.

That means each CPU can access some part of the memory directly which is local memory (besides L1,L2 and L3 cache) for it. It can also access local memory of the other processor which is remote memory for it but that will involve some latency. This latency depends upon the number of hops it has to take to access the remote memory. Take this as an configuration as an example.

If Core 1 in CPU1 has to access memory of Core 1 in CPU2, it will involve 1 hop but if it has to access Core 1 of CPU 4, it will have to go through 2 hops.

So if you are hosting a VM on this server and this server has 64 GB RAM, it would be a good idea to limit the VM size to 4 cores and 64/4 = 16 GB of RAM so that all the memory allocated is local to the processors.

We call this group of processors which share the same local memory, a NUMA node. With NUMA nodes in a system, a kernel has to intelligently schedule multi-threaded applications on it so that the latency to access memory is minimized. With the increasing number of NUMA nodes and complexity of their interconnects, scheduling of multi threaded applications is a complex problem and susceptible to bugs creeping up in the implementation.

I will write about the bugs in Linux Scheduler in the next part of the series.

Discusssions and questions are encouraged in comments and on Hacker News , /r/linux.

You can subscribe to this blog via RSS or using a RSS to Email service.


comments powered by Disqus