# A CASE FOR TRANSFORMING PARALLEL RUNTIMES INTO OS KERNELS



Kyle Hale

Peter Dinda

halek.co v3vee.org presciencelab.org xstack.sandia.gov/hobbes





xstack.sandia.gov/hobbes



v3vee.org





# THE CURRENT OS/RUNTIME MODEL



# THIS MODEL HAS SOME ISSUES

# ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES?



# ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES?



# **RESTRICTED ACCESS TO HARDWARE**



# **RESTRICTED ACCESS TO HARDWARE**



#### What are the consequences?

#### What are the consequences?

# WORKAROUNDS & COMPROMISES

#### What are the consequences?

# WORKAROUNDS & COMPROMISES

# **DUPLICATED FUNCTIONALITY**

# If runtime had

### we could mitigate these issues

## If runtime had

# **FULL HARDWARE ACCESS**

#### we could mitigate these issues

## If runtime had

# **FULL HARDWARE ACCESS**

# CONTROL OVER KERNEL ABSTRACTIONS

### we could mitigate these issues

# THE CURRENT OS/RUNTIME MODEL







| kernel mode | parallel app                            |               |
|-------------|-----------------------------------------|---------------|
| The runtime | <b>S</b> the kernel, built within a ker | nel framework |
|             | hybrid runtime                          |               |
|             | node HW                                 |               |

user mode kernel mode parallel app The runtime IS the kernel, built within a kernel framework Everything is in kernel space

node HW

user mode kernel mode parallel app The runtime IS the kernel, built within a kernel framework Everything is in kernel space HRT has full access to the hardware

node HW

| kernel mode | parallel app                  |   |
|-------------|-------------------------------|---|
|             | HRT can control HW access     |   |
| HRT         | can pick its own abstraction: | S |
|             |                               |   |
|             | node HW                       |   |





NAUTILUS

#### We ported an existing, complex parallel runtime

NAUTILUS

#### We ported an existing, complex parallel runtime



We ported an existing, complex parallel runtime

# We ported our framework to cutting-edge many-core hardware



#### NAUTILUS

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware XEON



#### NAUTILUS

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

We evaluated our port on a standard HPC benchmark

#### LEGION legion.stanford.edu



XEON PHI

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

We evaluated our port on a standard HPC benchmark





XEON PHI

#### XEON PHI + NAUTILUS + LEGION + HPCG



#### XEON PHI + NAUTILUS + LEGION + HPCG



#### XEON PHI + NAUTILUS + LEGION + HPCG



# NAUTILUS

#### user mode

#### kernel mode

| parallel app |       |        |        |            |           |        |      |         |
|--------------|-------|--------|--------|------------|-----------|--------|------|---------|
| runtime      |       |        |        |            |           |        |      |         |
| threads      | sync. | paging | events | HW<br>info | bootstrap | timers | IRQs | console |
| Hardware     |       |        |        |            |           |        |      |         |

# NAUTILUS

#### user mode

#### kernel mode

| parallel app |       |        |        |            |           |        |      |         |
|--------------|-------|--------|--------|------------|-----------|--------|------|---------|
| runtime      |       |        |        |            |           |        |      |         |
|              |       |        |        |            |           |        |      |         |
| threads      | sync. | paging | events | HW<br>info | bootstrap | timers | IRQs | console |
|              |       |        |        |            |           |        |      |         |
| 1 Hardware   |       |        |        |            |           |        |      |         |
|              |       |        |        |            |           |        |      |         |
|              |       |        |        |            |           |        |      |         |

Nautilus primitives & utilities (HRT can use or not use any of them)



Nautilus primitives & utilities (HRT can use or not use any of them)

#### NAUTILUS

#### user mode

#### kernel mode

|          |       |        | р      | arallel a  | pp        |        |      |         |  |
|----------|-------|--------|--------|------------|-----------|--------|------|---------|--|
| runtime  |       |        |        |            |           |        |      |         |  |
| threads  | sync. | paging | events | HW<br>info | bootstrap | timers | IRQs | console |  |
| Hardware |       |        |        |            |           |        |      |         |  |
| HR       | Т     |        |        |            |           |        |      |         |  |

## NAUTILUS

#### user mode

#### kernel mode parallel app runtime HW bootstrap threads paging events timers console IRQs sync. info Hardware

#### Kernel

#### **MINIMAL LIGHTWEIGHT PRIMITIVES**

#### **FULL HARDWARE ACCESS**

#### **VERY FAST BOOT TIMES**



40



41





very simple modification: give runtime control over interrupts in its task scheduler

very simple modification: give runtime control over interrupts in its task scheduler

#### → modest speedups

very simple modification: give runtime control over interrupts in its task scheduler

#### $\rightarrow$ modest speedups

**MUCH more to come here** 

#### in addition to Legion, we have 2 other high-level, parallel runtimes running as HRTs

**NESL**: VCODE interpreter running as HRT

NDPC: home-grown, **co-designed** HRT

#### **INTEGRATING THE HRT WITH A LEGACY OS**

## THE HYBRID VIRTUAL MACHINE



# **LINUX FORK + EXEC** ~ 714µs

## **HVM + HRT CORE BOOT** ~ $61\mu s$

## **LINUX FORK + EXEC** ~ 714µs

## HRT boot is CHEAP!

## HVM + HRT CORE BOOT ~ 61µs

| RANK | SITE                                                               | SYSTEM                                                                                                                               | CORES     | RMAX<br>(TFLOP/S) | RPEAK<br>(TFLOP/S) | POWER<br>(KW) |
|------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|-----------|-------------------|--------------------|---------------|
| 1    | National Super Computer Center in<br>Guangzhou<br>China            | <b>Tianhe-2 (MilkyWay-2)</b> - TH-IVB-FEP<br>Cluster, Intel Xeon E5-2692 12C 2.200GHz,<br>TH Express-2, Intel Xeon Phi 31S1P<br>NUDT | 3,120,000 | 33,862.7          | 54,902.4           | 17,808        |
| 7    | Texas Advanced Computing<br>Center/Univ. of Texas<br>United States | Stampede - PowerEdge C8220, Xeon E5-<br>2680 8C 2.700GHz, Infiniband FDR, Intel<br>Xeon Phi SE10P<br>Dell                            | 462,462   | 5,168.1           | 8,520.1            | 4,510         |
| 18   | DOE/SC/Pacific Northwest National                                  | cascade - Atipa Visione IF442 Blade Server,                                                                                          | 194,616   | 2,539.1           | 3,388.0            | 1,384         |

## **NAUTILUS + XEON PHI**

| 33 | United States              | E5-2670 8C 2 600GHz, Infiniband FDR, Intel<br>Xeon Phi 5110P<br>Hewlett-Packard                                                  | 77,520  | 770.0 | 1,341.1 | 510             |
|----|----------------------------|----------------------------------------------------------------------------------------------------------------------------------|---------|-------|---------|-----------------|
| 64 | Tulip Trading<br>Australia | <b>C01N</b> - SuperBlade SBI-7127RG-E, Intel<br>Xeon E5-2695v2 12C 2.4GHz, Infiniband FDR,<br>Intel Xeon Phi 7120P<br>Supermicro | 160,600 | 798.3 | 3,164.5 | 619             |
| 69 | Intel<br>United States     | Endeavor - Intel Cluster, Intel Xeon E5-<br>2697y2 12C 2 700GHz, Infiniband FDR, Intel                                           | 51,392  | 758.9 | 933.5   | 54 <b>387.2</b> |

```
[root@v-test-t620 nautilus]#
[root@v-test-t620 nautilus]# philix -d -b weever -k nautilus.bin[]
```

```
k
```

#### XEON PHI + NAUTILUS + LEGION + HPCG



#### A CASE FOR TRANSFORMING PARALLEL RUNTIMES INTO OS KERNELS

my website halek.co



our lab presciencelab.org

#### follow us here for:

- experience report on building OS for Phi
- philix release (soon)

## the Hobbes project **xstack.sandia.gov/hobbes**

Pab



Kyle Hale



Peter Dinda

## BACKUPS





| user mode   |                |  |
|-------------|----------------|--|
| kernel mode | parallel app   |  |
|             | hybrid runtime |  |
|             | node HW        |  |













#### **HPCG IN LEGION ON XEON PHI**



#### **HPCG IN LEGION ON XEON PHI**



#### port of NESL

- nested data parallel language aimed at vector machines



- nested data parallel language aimed at vector machines

- we can run unmodified NESL programs in our kernel-mode VCODE interpreter

#### the first co-designed HRT: NDPC

- Nested Data Parallelism in C/C++
- subset of NESL

#### the first co-designed HRT: NDPC

- Nested Data Parallelism in C/C++
- subset of NESL
- fork/join parallelism over flattened vector processing

#### the first co-designed HRT: NDPC

- Nested Data Parallelism in C/C++
- subset of NESL
- fork/join parallelism over flattened vector processing
- allows us to explore runtime/kernel co-design
- e.g. smart kernel-mode thread fork



#### follow our blog

 use our tool (philix) to boot it and leverage MPSS stack

- follow our blog
- use our tool (philix) to boot it and leverage MPSS stack

- follow our blog
- use our tool (philix) to boot it and leverage MPSS stack

## find out more @ haltloop.com