# A Case for Random Shortcut Topologies for HPC Interconnects

M. Koibuchi (National Institute of Informatics, JP)



H. Matsutani, H. Amano (Keio University, JP)



D. F. Hsu (Fordham University)



H. Casanova (University of Hawaii at Manoa)



## Highlight

Objective: Make a Low-latency topology of HPC NWs

- Switch delay dominates NW delay(>100ns/hop)
  - Decreasing average path hops, and diameter

Idea: Classical topology with random shortcut links



- Motivation
- Graph analysis of random shortcut topology
- Simulation evaluation of random shortcut topology
- Discussion of limitations
- Conclusions



# Motivation to Reduce Hop Counts System Interconnects [Tomkins, 2008]

|                                                | 32,768 32,768<br>F 32 200 |      | 5               | 2019  |                       |      |
|------------------------------------------------|---------------------------|------|-----------------|-------|-----------------------|------|
| System Size<br>Sockets<br>Peak PF<br>TF/Socket |                           |      | 200             |       | 32,768<br>800<br>25.0 |      |
|                                                | Expect                    | Want | Expect          | Want  | Expect                | Want |
| NIC B/W (B/F)                                  | 0.01 - 0.1                | 1.0  | 0.005 -<br>0.03 | 1.0   | 0.025 -<br>0.25       | 1.0  |
| Link B/W (B/F)                                 | 0.01 - 0.1                | 1.0  | 0.005 -<br>0.03 | 1.0   | 0.025 -<br>0.25       | 1.0  |
| MPI Latency (ns)                               | 750 - 1500                | 500  | 500 - 1000      | 400   | 400 - 750             | 300  |
| MPT Inroughput<br>(M Msg/s)                    | 20                        | 50   | 80              | 300   | 300                   | 1200 |
| Load/Store<br>(M Msg/s)                        | 75                        | 400  | 150             | 1,600 | 300                   | 6400 |
| Load/Store<br>Latency (ns)                     | 300                       | 100  | 300             | 100   | 300                   | 100  |

1 us latency across system [Henmmert, 2008]



Switch delay: >100 ns, Link delay: 5ns/m



# **Existing HPC topology**

| Company | System<br>[Network] Name            | Max.<br>number<br>of nodes<br>[x # CPUs] | Basic network topology                                      | Injection<br>[Recept'n]<br>node BW<br>in<br>MBytes/s | # of data<br>bits per<br>link per<br>direction | Raw<br>network link<br>BW per<br>direction in<br>Mbytes/sec | Raw<br>network<br>bisection<br>BW (bidir)<br>in Gbytes/s |
|---------|-------------------------------------|------------------------------------------|-------------------------------------------------------------|------------------------------------------------------|------------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------|
| Intel   | ASCI Red<br>Paragon                 | 4,510<br>[x 2]                           | 2-D mesh<br>64 x 64                                         | 400<br>[400]                                         | 16 bits                                        | 400                                                         | 51.2                                                     |
| IBM     | ASCI White<br>SP Power3<br>[Colony] | 512<br>[x 16]                            | BMIN w/8-port<br>bidirect. switches (fat-<br>tree or Omega) | 500<br>[500]                                         | 8 bits (+1<br>bit of<br>control)               | 500                                                         | 256                                                      |
| Intel   | Thunter<br>Itanium2                 | 1,024                                    | fat tree w/8-port<br>bidirectional                          | 928                                                  | 8 bits (+2 control for                         | 1,333                                                       | 1,365                                                    |

Mesh, torus ...

#### Are such non-random topologies latency-sensitive?

|     |                                            |                 | with express links                                         |                  |                                   |       |       |
|-----|--------------------------------------------|-----------------|------------------------------------------------------------|------------------|-----------------------------------|-------|-------|
| IBM | ASC Purple<br>pSeries 575<br>[Federation]  | >1,280<br>[x 8] | BMIN w/8-port<br>bidirect. switches<br>(fat-tree or Omega) | 2,000<br>[2,000] | 8 bits (+2<br>bits of<br>control) | 2,000 | 2,560 |
| IBM | Blue Gene/L<br>eServer Sol.<br>[Torus Net] | 65,536<br>[x 2] | 3-D torus<br>32 x 32 x 64                                  | 612,5<br>[1,050] | 1 bit (bit<br>serial)             | 175   | 358.4 |

Timothy Pinkston, and Jose Duato, Computer Architecture: A Quantitative Approach4th Edition, Appendix E, 2006

## **Topology Design**

- Latency sensitive, less than 3KB packets [Hemmet,2007]
  - Reduce diameter and avg. shortest path hops
    - Switch delay >> link delay
- Enabling high-radix switches
  - Dozens of ports per switch
- Enabling user-defined routing paths
  - By updating routing tables (e,g, InfiniBand, Ethernet)



Myricom

**High-radix Network** 

- Motivation
- Graph analysis of random shortcut topology
- Simulation evaluation of random shortcut topology
- Discussion of limitations
- Conclusions

#### Randomness Makes Graph Shorter [6]

Vertex: Person/PC/airport\_



# Small-world phenomenon

- Social network
- P2P network
- Airport distribution



Its use for HPC interconnects

- Relatively high radix
- More uniformity of each switch degree
- Considering rack layout

#### **Average Shortest Path Hops**



Random links provide better average path hops

- also better diameter

# **Topology Scalability**



Randomness is increasingly beneficial as network size increases

## **Choice of Baseline Topologies**



Ring is best due to a larger number of shortcuts

- Motivation
- Graph analysis of random shortcut topology
- Simulation evaluation of random shortcut topology
- Discussion of limitations
- Conclusions

#### Simulation Environment



**Table 1: Switch & network parameters** 

|                     | _                         |
|---------------------|---------------------------|
| Packet length       | 33-flit (1-flit: 256 bit) |
| Switching technique | Virtual-cut through       |
| Traffic Pattern     | Uni, matrix-t, or bit rev |
| Number of VCs       | 2                         |
| Switch delay        | > 100 ns                  |
| Link delay          | 20 ns                     |

#### **Topology & Routing**

| Mesh, Hypercube | Duato     |
|-----------------|-----------|
| Torus           | DOR       |
| Ring + Random   | irregular |

#### **Accepted Traffic vs Latency**





(a) 256 switch, bit-rev

(a) 512 switch, matrix-trans

- (1) Random shortcuts improve latency by up to 18%
- (2) As # of shortcuts increases, more beneficial

#### Performance Variability



(a) Fault Tolerance (b) 20 different random instances

High-radix NW makes random topology robust to faulty links and variability of random generation

- Motivation
- Graph analysis of random shortcut topology
- Simulation evaluation of random shortcut topology
- Discussion of limitations
- Conclusions

#### Routing Scalability Issues

- Address and routing-table size at switch
  - InfiniBand LID: 48k
  - General issue regardless of topology
- Computational cost of path search
  - Topology-agnostic deadlock-free routing [Flich, TPDS2012]
    - O(N<sup>2</sup>) or higher
    - Only when initially deploying the system



# Physical Cable Length



InfiniBand Link length passive copper 10m active copper:40m

Optical:100m~



Random Top. can use the same media Wiring cost does not increase much



Parameters (Cray BlackWindows) 128 nodes/cabinet cabinet footprint: 0.57m x1.44m 2m cable overhead 75 nodes/m^2 density [Kim,ISCA07]

#### Conclusions

- Use of random shortcuts at HPC interconnects
  - Ring + random shortcuts is best
  - Advantage of high-radix networks
    - Little variability of sampling and performance
- Random shortcut topology imposes no constraints on the number of switches, and links



Hypercube (Non-random topology)

Random Shortcut Topology (Ring + random shortcuts)