TY - JOUR
T1 - Predicting vulnerability inducing function versions using node embeddings and graph neural networks
AU - Şahin, Sefa Eren
AU - Özyedierler, Ecem Mine
AU - Tosun, Ayse
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/5
Y1 - 2022/5
N2 - Context: Predicting software vulnerabilities over code changes is a difficult task due to obtaining real vulnerability data and their associated code fixes from software projects as software organizations are often reluctant to report those. Objective: We aim to propose a vulnerability prediction model that runs after every code change, and identifies vulnerability inducing functions in that version. We also would like to assess the success of node and token based source code representations over abstract syntax trees (ASTs) on predicting vulnerability inducing functions. Method: We train neural networks to represent node embeddings and token embeddings over ASTs in order to obtain feature representations. Then, we build two Graph Neural Networks (GNNs) with node embeddings, and compare them against Convolutional Neural Network (CNN) and Support Vector Machine (SVM) with token representations. Results: We report our empirical analysis over the change history of vulnerability inducing functions of Wireshark project. GraphSAGE model using source code representation via ASTs achieves the highest AUC rate, while CNN models using token representations achieves the highest recall, precision and F1 measure. Conclusion: Representing functions with their structural information extracted from ASTs, either in token form or in complete graph form, is great at predicting vulnerability inducing function versions. Transforming source code into token frequencies as a natural language text fails to build successful models for vulnerability prediction in a real software project.
AB - Context: Predicting software vulnerabilities over code changes is a difficult task due to obtaining real vulnerability data and their associated code fixes from software projects as software organizations are often reluctant to report those. Objective: We aim to propose a vulnerability prediction model that runs after every code change, and identifies vulnerability inducing functions in that version. We also would like to assess the success of node and token based source code representations over abstract syntax trees (ASTs) on predicting vulnerability inducing functions. Method: We train neural networks to represent node embeddings and token embeddings over ASTs in order to obtain feature representations. Then, we build two Graph Neural Networks (GNNs) with node embeddings, and compare them against Convolutional Neural Network (CNN) and Support Vector Machine (SVM) with token representations. Results: We report our empirical analysis over the change history of vulnerability inducing functions of Wireshark project. GraphSAGE model using source code representation via ASTs achieves the highest AUC rate, while CNN models using token representations achieves the highest recall, precision and F1 measure. Conclusion: Representing functions with their structural information extracted from ASTs, either in token form or in complete graph form, is great at predicting vulnerability inducing function versions. Transforming source code into token frequencies as a natural language text fails to build successful models for vulnerability prediction in a real software project.
KW - Abstract syntax trees
KW - Graph embeddings
KW - Graph neural networks
KW - Software vulnerabilities
UR - http://www.scopus.com/inward/record.url?scp=85123002701&partnerID=8YFLogxK
U2 - 10.1016/j.infsof.2022.106822
DO - 10.1016/j.infsof.2022.106822
M3 - Article
AN - SCOPUS:85123002701
SN - 0950-5849
VL - 145
JO - Information and Software Technology
JF - Information and Software Technology
M1 - 106822
ER -