Institut für Rechnerarchitektur und Parallelrechner

The same topics that were treated for single processors in our elementary system architecture lectures are treated here for multicore processors, namely processors, compilers and operating system kernels. The following complications arise and are addressed in the lectures.

1) The manufacturers documentation of the instruction sets of high end processors has thousands of pages, and any document with so many page is almost necesserily contradictory and incomplete. We have to boil this down to a manageable size. Where the manuals do not provide all necessary details we have to make educated guesses by reverse engineering parts of the processors hardware. This concerns in particular the memory system.

2) In the full instruction set many parts of the machines hardware are visible to the programmer that one prefers to be hidden. This concerns in particular: caches, the states of the cache coherence protocols used, store buffers and translation look aside buffers (TLB's) of memory units (MMUs). Moreover in general a multicore processor does NOT interleave instructions of the sequential instruction set architecture (ISA); if it is not properly configuerd and used it interleaves a sort of microinstructions. We will show how to configure a multicore processor such that it simultaneously
- runs in translated mode
- interleaves instructions of the ISA workig on a sequentially consistent shared memory.

3) We have to specify the semantics of parallel C. In general compiled C code running on a multicore machine does NOT interleave small steps semantics steps of sequential C threads. We formalize the subset of parallel C known as 'structured parallel C' and show how to compile it such that it runs on the machine model exhibited in 2). In particular we have to consider here volatile variables and their treatment by optimizing compilers.

4) In order to give semantics to assembler portions of C programs we have to specify the behavior of optimizing compilers for multicore machines at least as far as allocation functions are concerned. We give specifications which guarantee that when a C variable X is accessed by assembler code, then its current value (a non trivial concept) is stored at &X (with an optimizing compiler this is in general not the case).

5) We specify hypervisors for multicore machine. Hypervisors are operating system (OS) kernels whose user processes are so called partitions. This are simply virtual multicore processors which are allowed to run in system mode, i.e. in translated mode; thus each partition of a hypervisor can run its own OS. The specification of hypervisors is very CVM like (CVM is the generic OS kernel from the elementary architecture lectures).

6) Hypervisors have to compose the translation of the host machine with the translation of the guest partitions. On many processors there is 'virtualization support' for this in the form of 'nested page tables'; this is hardware supporting the composition of two translations. If we have such hardware and if the guest partitions do not run hypervisors but only operating systems (i.e. they have only a single level of translation), then hypervisor construction is very similar to the elementary case.

7) One can also construct hypervisors on processors without nested page tables; after all the composition of two translations is a translation. This requires to implement page tables for the composed translation in the form of a C data structure called 'shadow page tables' and to redirect the MMUs to walk these data structures (which simultaneously are updated by other cores). This leads to two very exciting subjects:
7.1. the design and correctness proof of a parallel page table algorithm and
7.2. the semantics of the parallel programming model in which this algorithm is programmed. In this model we have
- C portions
- assembly portions
- user visible translations (hypervisors have to inspect guest page tables; this happens to be implemented my memory relocation).
- MMU travesing (and even writing) the shadow page tables ( a C data structure) in parealle with the C threads.

We plan to produce lecture notes. They will be ready between 1 and 2 weeks after the lectures.

Prerequisites: ideally computer architecture and system architecture as read by the author. If you have heard none of these lectures prepare to do some serious additional reading; we will point you to the appropriate texts.