#### Keio University



# Accelerator Design for Various NOSQL Databases

Hiroki Matsutani

Dept. of ICS, Keio University

http://www.arc.ics.keio.ac.jp/~matutani

July 12th, 2016

International Forum on MPSoC for Software-Defined Hardware (MPSoC'16)

### Two competing trends in ICT



**Observation:** Without more energy-efficient solutions, augmenting more computers for Big data becomes harder

Limitations: Computers are already very efficient

Thousands of low-end commodity servers optimized for cost-performance and energy efficiency

We need Architectural Innovations (not rely on Moore's law)

#### The best solution changes depending on I/O intensity

#### <u>Storage & Virtual</u> <u>Machine (VM) migration</u>

- Big data transfer between servers
- Several GByte to TByte
   Dynamic 40GbE link
   w/ Free-space optics

#### **NOSQL** accelerator

- Simple & high scalability
- A lot of memory access while less computation
   NOSQL HW cache using FPGA & 40GbE

#### I/O intensive

#### **Customizable SiP for IoT**

 3D integration of CPU, memory, sensor, database
 → Wireless 3D stacking

#### In-GPU DB(Graph,Doc)

**Compute intensive** 

 Graph DB & Document DB (Regex search)
 Many GPUs over 10+10Gbps Ethernet

### Structured storages (NOSQLs)

Structured storages (NOSQLs) have good horizontal scalability, while they are specialized for some specific purposes



### Polyglot Persistence: Mixture of NOSQLs



suitable for tackling different problems

### Multilevel NOSQL cache: FPGA NIC





### **10GbE outlier filtering FPGA NIC**

#### **Sensor Data Explosion**



Machine learning algorithms
✓ Mahalanobis Distance
✓ Local Outlier Factor (LOF)
✓ K-Nearest Neighbor (KNN)



### **10GbE outlier filtering FPGA NIC**

#### **Sensor Data Explosion**

**Issue:** Software periodically peeks at NIC not to forget what is "normal"

**Result:** 14M samples/sec (95.8% of 10GbE line rate) [HEART'15 (Best paper award)]



Machine learning algorithms
✓ Mahalanobis Distance
✓ Local Outlier Factor (LOF)
✓ K-Nearest Neighbor (KNN)



User Space

#### The best solution changes depending on I/O intensity

#### <u>Storage & Virtual</u> <u>Machine (VM) migration</u>

- Big data transfer between servers
- Several GByte to TByte
   Dynamic 40GbE link
   w/ Free-space optics

#### **NOSQL** accelerator

- Simple & high scalability
- A lot of memory access while less computation
   → NOSQL HW cache
  - using FPGA & 40GbE

**Compute intensive** 

#### I/O intensive

#### **Customizable SiP for IoT**

 3D integration of CPU, memory, sensor, database
 → Wireless 3D stacking

#### In-GPU DB(Graph,Doc)

 Graph DB & Document DB (Regex search)
 Many GPUs over 10+10Gbps Ethernet

### **NOSQL** cache with GPUs



### In-GPU distributed DB w/ ExpEther

To exploit more GPUs  $\rightarrow$  In-GPU databases

- In-GPU distributed DBs over NEC ExpEther
  - GPU's device memory is used as a cache of the DB
  - Many GPUs are directly connected via 10GbE switch



### In-GPU distributed DB w/ ExpEther

# Many GPUs are directly connected to DB server via NEC ExpEther (20Gbps)

**IOG**bE

Switch

PCIe Card

inserted in

DB server

-----

10G + 10G

#### The best solution changes depending on I/O intensity

#### <u>Storage & Virtual</u> <u>Machine (VM) migration</u>

- Big data transfer between servers
- Several GByte to TByte
   Dynamic 40GbE link
   w/ Free-space optics

#### **NOSQL** accelerator

- Simple & high scalability
- A lot of memory access while less computation
   → NOSQL HW cache
  - using FPGA & 40GbE

**Compute intensive** 

#### I/O intensive

#### **Customizable SiP for IoT**

 3D integration of CPU, memory, sensor, database
 → Wireless 3D stacking

#### In-GPU DB(Graph,Doc)

 Graph DB & Document DB (Regex search)
 Many GPUs over 10+10Gbps Ethernet

### Wireless 3D chip stacking for IoT

- System-in-Package (SiP) for sensor nodes

   Required chips are selected and stacked in package
   E.g., CPU chip, Memory chip, Sensor chip, ...
- Wireless inductive-coupling for vertical links
  - Not electrically-connected
  - Add, remove, and swap chips for given applications



### Wireless 3D chip stacking for IoT

#### We can change the number & types of chips in a package



In addition we've implemented "KVS memory chip" where intermediate data or computation results of processors are stored as key-value pairs for reuse Next version of KVS chip will be tapeout'ed on July 15 Host CPU + 3 Accelerators **Accelerator Chip** 

#### The best solution changes depending on I/O intensity

#### <u>Storage & Virtual</u> <u>Machine (VM) migration</u>

- Big data transfer between servers
- Several GByte to TByte
   Dynamic 40GbE link
   w/ Free-space optics

#### **NOSQL** accelerator

- Simple & high scalability
- A lot of memory access while less computation
   → NOSQL HW cache
  - using FPGA & 40GbE

**Compute intensive** 

#### I/O intensive

#### **Customizable SiP for IoT**

 3D integration of CPU, memory, sensor, database
 → Wireless 3D stacking

#### In-GPU DB(Graph,Doc)

 Graph DB & Document DB (Regex search)
 Many GPUs over 10+10Gbps Ethernet

### Dynamic 40G shortcut links w/ FSO

- Emergent big data transfers in Datacenter NW
  - Virtual machine (VM) migration
  - Storage migration and DB streaming
  - E.g., Several minutes for VM migration w/ 1GbE
- Dynamic shortcut links using "40Gps beam"



### Dynamic 40G shortcut links w/ FSO

- Emergent big data transfers in Datacenter NW
  - Virtual machine (VM) migration
  - Storage migration and DB streaming
  - E.g., Several minutes for VM migration w/ 1GbE
- Dynamic shortcut links using "40Gps beam"



# "VM Highway" using 40G FSO

Dynamic 40GbE links for VM (virtual machine) migration Direction of collimator lens connected to 40GbE LR4 (1300nm wavelength) is adjusted so as to make a direct 40Gbps link between two racks

Web

rror

#### The best solution changes depending on I/O intensity



# References (1/3)

- Key-value store accelerators
  - Yuta Tokusashi, et.al., "A Multilevel NOSQL Cache Design Combining In-NIC and In-Kernel Caches", Hot Interconnects 2016.
  - Yuta Tokusashi, et.al., "NOSQL Hardware Appliance with Multiple Data Structures", Hot Chips 2016 (Poster).
  - Korechika Tamura, et.al., "An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC", FPGA4GPC 2016.
- Machine learning accelerator
  - Ami Hayashi, et.al., "A Line Rate Outlier Filtering FPGA NIC using 10GbE Interface", ACM SIGARCH CAN (2015). 22

# References (2/3)

- GPU-based accelerations of NOSQLs
  - Shin Morishima, et.al., "Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet", HeteroPar 2016.
  - Shin Morishima, et.al., "Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure", ISPA 2015.
  - Shin Morishima, et.al., "Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries", ACM SIGARCH CAN (2014).
- Free-space optics (FSO) for data centers
  - Ikki Fujiwara, et.al., "Augmenting Low-latency HPC Network with Free-space Optical Links", HPCA 2015.

## References (3/3)

- Wireless inductive-coupling 3D stacking
  - Takahiro Kagami, et.al., "Efficient 3-D Bus Architectures for Inductive-Coupling ThruChip Interfaces", IEEE TVLSI (2016).
  - Hiroki Matsutani, et.al, "Low-Latency Wireless 3D NoCs via Randomized Shortcut Chips", DATE 2014.
  - Yasuhiro Take, et.al., "3D NoC with Inductive-Coupling Links for Building-Block SiPs", IEEE TC (2014).
  - Hiroki Matsutani, et.al., "A Case for Wireless 3D NoCs for CMPs", ASP-DAC 2013. (Best Paper Award)

# Thank you for listening!

Acknowledgement: A part of this work is supported by JST PRESTO