BAX: A Bundle Adjustment Accelerator With Decoupled Access/Execute Architecture for Visual Odometry

As the demand for embedded-vision grows, solving large optimization problems in real-time with energy and cost budget is a challenge. We present BAX, a hardware accelerator of bundle adjustment (BA), which solves the least-squares problem of state estimation in visual odometry (VO). BAX consists of...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 8; pp. 75530 - 75542
Main Authors	Sun, Rongdi, Liu, Peilin, Xue, Jianwei, Yang, Shiyu, Qian, Jiuchao, Ying, Rendong
Format	Journal Article
Language	English
Published	Piscataway IEEE 2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Bundle adjustment Cameras Computer architecture Computer memory decoupled architecture embedded system Floating point arithmetic FPGA Graphics processing units Hardware accelerator Jacobian matrices Least squares method Mathematical analysis Matrix algebra Matrix decomposition Matrix methods Multiplication Optimization Power consumption State estimation Three-dimensional displays Vector processing (computers) visual odometry Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	As the demand for embedded-vision grows, solving large optimization problems in real-time with energy and cost budget is a challenge. We present BAX, a hardware accelerator of bundle adjustment (BA), which solves the least-squares problem of state estimation in visual odometry (VO). BAX consists of a frontend and a backend for control and computation, respectively. The frontend generates instructions on-the-fly executed at the backend to perform the BA algorithm. The backend adopts decoupled access/execute (DAE) architecture, which separates the memory access unit (MAU) from the pipeline. The MAU can prefetch vectors and matrices ahead of computations. To further reduce the latency of data reorganization, three transpose-free dataflows are proposed for matrix multiplication operations on the vector processing unit (VPU). Besides, a unified architecture for both forward and backward substitution is designed for matrix decomposition in the linear solver. All the data are stored in 442kB on-chip memory, and the local map is maintained efficiently by the hierarchical graph memory. Compared with the baseline architecture, the processing time is reduced by 53.9% through the above techniques. BAX is implemented in 32-bit floating-point precision with data normalization on FPGA. It completes a full BA in about 63.44ms at 200MHz, consuming 1.12W power. BAX is <inline-formula> <tex-math notation="LaTeX">1.73\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">22.38\times </tex-math></inline-formula> faster than the desktop and embedded CPUs, respectively, and achieves 90% performance of the GPU at much less power consumption.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2020.2988527