Efficient Training and Compression of Deep Neural Networks

The application of deep neural networks is widespread throughout the world and is responsible for many crucial applications such as self-driving cars, machine translation, spoken language recognition, procedural content generation and medical diagnosis to name a few. However, improving the performan...

Full description

Saved in:

Bibliographic Details
Main Author	O' Neill, James
Format	Dissertation
Language	English
Published	ProQuest Dissertations & Theses 01.01.2021
Subjects	Language Natural language
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The application of deep neural networks is widespread throughout the world and is responsible for many crucial applications such as self-driving cars, machine translation, spoken language recognition, procedural content generation and medical diagnosis to name a few. However, improving the performance of neural networks generally corresponds to deeper and wider architectures, increasing the parameter count, training time and inference time to scales that exceed the typical resources available to the majority of machine learning practitioners. Neural network compression is an area of research that can address these concerns by reducing the size the network while aiming to maintain the performance of the network prior to compression. Although this research area has been somewhat active for the past three decades, it has seen a notable and proportional resurgence recently due to the rate of model size increase in deep neural networks. In this context, there are still various limitations to current compression methods and their applicability to current neural networks used for natural language processing and computer vision, which this thesis aims to address. Firstly, many compression methods sparsify networks which leads to a theoretical parameter reduction but practically this does not lead to a reduction in storage and inference time because current hardware is not designed to implement sparse matrix multiplications efficiently. Therefore, in practice, dense matrix multiplications are carried out on a sparse network by multiplying the parameter tensors with a binary mask, leading to more parameters, not less. Dynamic weight sharing techniques have been under-explored as an alternative to structured pruning techniques that aim to avoid this pragmatic challenge related to efficiently using sparse networks post-compression. Hence, in this thesis we discuss dynamic weight sharing techniques that aim to preserve density in the network without zeroing out whole structures.Secondly, compression methods are typically evaluated in the supervised learning setting. Thus, little is known about how our assumptions in the supervised learning setting hold against other settings such as few-shot transfer learning or zero-shot domain adaptation (e.g zero-shot cross-lingual transfer when using cross-lingual models). Therefore, we explore how iterative pruning behaves in the few-shot and zero-shot cases.Thirdly, compression methods such has pruning and knowledge distillation have primarily been adopted in isolation, without much insight as to how they might be used in tandem to further boost compression performance. We also investigate whether this is viable and how both can be used simultaenously to reduce the requirement of a two-stage compression process.Lastly, compression is usually carried out on the classification model. However, in natural language processing we often learn an embedding model that outputs character, sub-word or word representations that are used as input to the classifier. Hence, we also explore compression methods that reconstruct an ensemble set of sub-word and word representations, where the resulting learned meta-embeddings are used as input to classification models for downstream tasks, generally outperforming the defacto single sub-word or word representations that are typically used.Hence, this thesis investigates and proposes novel compression methods and more efficient training of pretrained deep networks that improve the current state of the art in domains such as natural language and computer vision. This broadly includes contributions to knowledge distillation, pruning, dynamic weight sharing and improving fine-tuning in the transfer learning setting.
ISBN:	9798352972229