[Top Page]

Hiroki Matsutani

Associate Professor, Dept. of Information and Computer Science, Faculty of Science and Technology, Keio University

This page briefly introduces my research topics in order not to forget :-)
My biography, publications, awards, professional services, etc are listed in my top page.

(Last updated 2021-06-30)

Machine Learning

On-Device Learning

As a neural-network based on-device learning, online sequential learning unsupervised anomaly detection and its low-cost implementation on FPGA were proposed in [1]. A multi-instance design of the on-device learning for multiple normal patterns was proposed in [2]. It was extended for a federated learning so that edge devices can share their intermediate trained results in [3]. An on-device Recurrent Neural Network (RNN or Reservoir computing) core was proposed in [4]. The hardware design using a 45nm process technology was reported in [4]. An FPGA-based on-device reinforcement learning that uses L2 regularization and spectral normalization was proposed in [5].

  1. Mineto Tsukada, et al., "A Neural Network-Based On-device Learning Anomaly Detector for Edge Devices", IEEE Trans. on Computers (2020). Featured Paper in July 2020 Issue of IEEE TC. [Paper]
  2. Rei Ito, et al., "An Adaptive Abnormal Behavior Detection using Online Sequential Learning", EUC'19. [Paper]
  3. Rei Ito, et al., "An On-Device Federated Learning Approach for Cooperative Model Update between Edge Devices", IEEE Access (2021). [Paper]
  4. Takuya Sakuma, et al., "An Area-Efficient Recurrent Neural Network Core for Unsupervised Time-Series Anomaly Detection", IEICE Trans. on Electronics (2021). [Paper]
  5. Hirohisa Watanabe, et al., "An FPGA-Based On-Device Reinforcement Learning Approach using Online Sequential Learning", RAW'21. [Paper]

Machine Learning Accelerator

An outlier detection based on Mahalanobis distance was implemented in 10GbE FPGA NIC and achieved almost 10Gbps throughput in [1]. Outlier detections based on k-nearest neighbor (k-NN) and local outlier factor (LOF) algorithms were implemented in 10GbE FPGA NIC in [2]. Both change-point detection and outlier detection were accelerated by the 10GbE FPGA NIC in [3]. In addition, ordinary differential equation (ODE) based neural networks (e.g., Neural ODE) were designed on low-cost FPGA devices for high accuracy while reducing the parameter size in [4].

  1. Ami Hayashi, et al., "A Line Rate Outlier Filtering FPGA NIC using 10GbE Interface", ACM Computer Architecture News (2015). [Paper]
  2. Ami Hayashi, et al., "An FPGA-Based In-NIC Cache Approach for Lazy Learning Outlier Filtering", PDP'17. [Paper]
  3. Takuma Iwata, et al., "An FPGA-Based Change-Point Detection for 10Gbps Packet Stream", IEICE Trans. on Information and Systems (2019). [Paper]
  4. Hirohisa Watanabe, et al., "Accelerating ODE-Based Neural Networks on Low-Cost FPGAs", RAW'21. [Paper]

Distributed Deep Learning Accelerator

Distributed deep learning was accelerated by multiple GPUs and a 10GbE FPGA switch in [1]. Four optimization algorithms, such as SGD and Adam, were accelerated by the FPGA switch in [1]. Alternatively, parameter aggregatiion was accelerated by DPDK-based in-network swtich in [2].

  1. Tomoya Itsubo, et al., "Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch", PDP'20. [Paper]
  2. Masaki Furukawa, et al., "An In-Network Parameter Aggregation using DPDK for Multi-GPU Deep Learning", CANDAR'20. Best Paper Award. [Paper]

Simultaneous Localization and Mapping

SLAM Accelerator

A 2D LiDAR SLAM algorithm based on Rao-Blackwellized Particle Filter (RBPF) was accelerated by PYNQ-Z2 FPGA board in [1]. A graph-based SLAM algorithm (Cartographer) was implemented on the same FPGA board in [2]. Also it was compared to the particle filter based SLAM implementation.

  1. Keisuke Sugiura, et al., "An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm", IEICE Trans. on Information and Systems (2021). [Paper]
  2. Keisuke Sugiura, et al., "Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs", arXiv:2103.09523 (2021). [Paper]


Blockchain Search Accelerator

For bitcoin and blockchain applications, to mitigate full node accesses from IoT devices (i.e., SPV nodes), an FPGA-based blockchain transaction caching was proposed in [1], and an FPGA-based blockchain search accelerator was proposed in [2]. The blockchain search was accelerated by GPU in [3]. Blockchain data was cached in GPU device memory, and anomaly detection of the data was accelerated using the GPU in [4].

  1. Yuma Sakakibara, et al., "A Hardware-Based Caching System on FPGA NIC for Blockchain", IEICE Trans. on Information and Systems (2018). [Paper]
  2. Yuma Sakakibara, et al., "Accelerating Blockchain Transfer System Using FPGA-Based NIC", ISPA'18. [Paper]
  3. Shin Morishima, et al., "Accelerating Blockchain Search of Full Nodes Using GPUs", PDP'18. [Paper]
  4. Shin Morishima, et al., "Acceleration of Anomaly Detection in Blockchain Using In-GPU Cache", ISPA'18. [Paper]

Big Data Processing

Data Processing Accelerator

Spark (batch processing) was accelerated with network-attached GPUs via a PCI-Express over 10GbE technology in [1]. Spark Streaming (stream processing) was accelerated with FPGA-based 10GbE NIC in [2]. A message queuing system was accelerated by combining FPGA-based 10GbE NIC and in-kernel cache in [3].

  1. Yasuhiro Ohno, et al., "Accelerating Spark RDD Operations with Local and Remote GPU Devices", ICPADS'16. [Paper]
  2. Kohei Nakamura, et al., "An FPGA-Based Low-Latency Network Processing for Spark Streaming", BigData'16 Workshops. [Paper]
  3. Koya Mitsuzuka, et al., "MultiMQC: A Multilevel Message Queuing Cache Combining In-NIC and In-Kernel Memories", FPT'18. [Paper]

NoSQL Accelerator

Key-value store NoSQL (e.g., memcached, Redis) was accelerated by combining FPGA-based 10GbE NIC and in-kernel cache in [1]. The former is called L1 NoSQL cache, and the latter is called L2 NoSQL cache. Column-oriented NoSQL (e.g., HBase) was accelerated with FPGA-based 10GbE in [2] and in-kernel cache in [3]. Document-oriented NoSQL (e.g., MongoDB) was accelerated with network-attached GPUs via a PCI-Express over 10GbE technology in [4]. Graph database (e.g., Neo4j) was accelerated with GPU in [5].

  1. Yuta Tokusashi, et al., "A Multilevel NOSQL Cache Design Combining In-NIC and In-Kernel Caches", Hot Interconnects'16. [Paper]
  2. Akihiko Hamada, et al., "Design and Implementation of Hardware Cache Mechanism and NIC for Column-Oriented Databases", ReConFig'16. [Paper]
  3. Korechika Tamura, et al., "An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC", FPGA4GPC'16. [Paper]
  4. Shin Morishima, et al., "Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet", Euro-Par'16 Workshops. [Paper]
  5. Shin Morishima, et al., "Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries", ACM Computer Architecture News (2014). [Paper]

Datacenter Network

Random Datacenter Network

Random shortcut topologies were explored for HPC interconnects in [1], and those that takes into account the physical rack layout were proposed in [2]. 40Gbps free-space optics (light beam) was exploited as shortcut links for HPC interconnects in [3].

  1. Michihiro Koibuchi, et al., "A Case for Random Shortcut Topologies for HPC Interconnects", ISCA'12. [Paper]
  2. Michihiro Koibuchi, et al., "Layout-conscious Random Topologies for HPC Off-chip Interconnects", HPCA'13. [Paper]
  3. Ikki Fujiwara, et al., "Augmenting Low-latency HPC Network with Free-space Optical Links", HPCA'15. [Paper]

Approximate Network

An approximate datacenter network that optimizes low-latency while keeping accuracy was explored in [1]. In-switch approximation methods (e.g., proxy computation and response) for delayed tasks or stragglers in MapReduce were proposed in [2].

  1. Daichi Fujiki, et al., "High-Bandwidth Low-Latency Approximate Interconnection Networks", HPCA'17. [Paper]
  2. Koya Mitsuzuka, et al., "In-Switch Approximate Processing: Delayed Tasks Management for MapReduce Applications", FPL'17. [Paper]


NoC Verilog Generator (nocgen)

An NoC (Network-on-Chip) generator that generates Verilog HDL model of NoC consisting of on-chip routers, called nocgen, is available at GitHub.

Wireless 3D NoC

An inductive-coupling based wireless 3D Network-on-Chip in which each chip or component can be added, removed, and swapped (called "field-stackable") was proposed in [1]. Its routing and topologies were explored in [2] and [3]. The "field-stackable" concept was demonstrated in Cube-1 system in which the numbers of CPU chips and accelerator chips can be customized in [4].

  1. Hiroki Matsutani, et al., "A Vertical Bubble Flow Network using Inductive-Coupling for 3-D CMPs", NOCS'11. [Paper]
  2. Hiroki Matsutani, et al., "A Case for Wireless 3D NoCs for CMPs", ASP-DAC'13. Best Paper Award. [Paper]
  3. Hiroki Matsutani, et al., "Low-Latency Wireless 3D NoCs via Randomized Shortcut Chips", DATE'14. [Paper]
  4. Noriyuki Miura, et al., "A Scalable 3D Heterogeneous Multicore with an Inductive ThruChip Interface", IEEE Micro (2013). [Paper]

Low-Power NoC

A fine-grained power-gating router, where router components (e.g., input VC buffer, output buffer, VC mux, crossbar) are independently power-gated, was proposed in [1]. The fine-grained power-gating router was implemented with a 65nm process technology in [2]. A variable-pipeline router that dynamically adjusts voltage and pipeline depth (not frequency like DVFS) was proposed in [3].

  1. Hiroki Matsutani, et al., "Run-Time Power Gating of On-Chip Routers Using Look-Ahead Routing", ASP-DAC'08. [Paper]
  2. Hiroki Matsutani, et al., "Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time Power-Gating Routers for CMPs", IEEE Trans. on CAD (2011). [Paper]
  3. Hiroki Matsutani, et al., "A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs", ASP-DAC'12. Best Paper Candidate. [Paper]
  4. Akram Ben Ahmed, et al., "AxNoC: Low-power Approximate Network-on-Chips using Critical-Path Isolation", NOCS'18. Best Paper Candidate. [Paper]

Low-Latency NoC

A one-cycle low-latency on-chip router that forwards packets based on "path prediction" was proposed in [1].

  1. Hiroki Matsutani, et al., "Prediction Router: Yet Another Low Latency On-Chip Router Architecture", HPCA'09. [Paper]

NoC Topology

A novel topology, called Fat H-Tree, that forms a torus structure by combining two H-Tree topologies (called Red and Black trees) was proposed in [1]. 3D layouts of Fat H-Tree and Fat Tree were proposed in [2]. A class of 3D network topologies based on 3D crossbar was proposed in [3]. A fault-tolerant mesh-based NoC that includes an additional Hamiltonian ring path, called Default Backup Path, was proposed in [4].

  1. Hiroki Matsutani, et al., "Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network", IPDPS'07. [Paper]
  2. Hiroki Matsutani, et al., "Three-Dimensional Layout of On-Chip Tree-Based Networks", I-SPAN'08. [Paper]
  3. Hiroki Matsutani, et al., "Tightly-Coupled Multi-Layer Topologies for 3-D NoCs", ICPP'07. [Paper]
  4. Michihiro Koibuchi, et al., "A Lightweight Fault-tolerant Mechanism for Network-on-chip", NOCS'08. [Paper]