Test Input Prioritization for Graph Neural Networks

GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However,...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on software engineering Vol. 50; no. 6; pp. 1396 - 1424
Main Authors Li, Yinghua, Dang, Xueqi, Pian, Weiguo, Habib, Andrew, Klein, Jacques, Bissyande, Tegawende F.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.06.2024
IEEE Computer Society
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0098-5589
1939-3520
DOI:10.1109/TSE.2024.3385538