Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters

We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on parallel and distributed systems Vol. 32; no. 2; pp. 379 - 391
Main Authors	Al Badawi, Ahmad, Veeravalli, Bharadwaj, Lin, Jie, Xiao, Nan, Kazuaki, Matsumura, Khin Mi Mi, Aung
Format	Journal Article
Language	English
Published	New York IEEE 01.02.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Central processing units Circuits Clusters Computational modeling Computer architecture CPUs Data exchange Design analysis Distributed memory Encryption Field programmable gate arrays Graphics processing units Homomorphic encryption Image classification multi-GPU clusters Multiplication parallel algorithms Parallel processing Performance evaluation Task analysis Task scheduling Workload
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous <inline-formula><tex-math notation="LaTeX">{\sf NVIDIA}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">NVIDIA</mml:mi></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq1-3021238.gif"/> </inline-formula> GPU clusters: <inline-formula><tex-math notation="LaTeX">{\sf K80}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">K</mml:mi><mml:mn mathvariant="sans-serif">80</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq2-3021238.gif"/> </inline-formula>, and a customized <inline-formula><tex-math notation="LaTeX">{\sf P100}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">P</mml:mi><mml:mn mathvariant="sans-serif">100</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq3-3021238.gif"/> </inline-formula>. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the <inline-formula><tex-math notation="LaTeX">{\sf MNIST}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">MNIST</mml:mi></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq4-3021238.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">{\sf CIFAR-10}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">CIFAR</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="sans-serif">10</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="qaisarahmadalbadawi-ieq5-3021238.gif"/> </inline-formula> datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2020.3021238