TY - JOUR
T1 - CNN-based server state monitoring and fault diagnosis using infrared thermal images
AU - Wiysobunri, Beltus Nkwawir
AU - Erden, Hamza Salih
AU - Toreyin, Behcet Ugur
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2024
Y1 - 2024
N2 - The recent spike in the demand for high-performance computing (HPC) server systems has birthed many challenges in data center (DC) facilities. These challenges include but are not limited to thermal management, system reliability sustenance, and server failure minimalization. In an attempt to solve the latter challenge, this paper proposes a deep convolutional neural network-based transfer learning approach for the automatic diagnosis of five server operation states: partial CPU load; maximum CPU load; main fan failure; CPU fan failure; and server entrance block- age. This transfer learning approach involves two main stages. The first stage consists of a deep neural network pretrained on the large ImageNet dataset that automatically extracts lower-level features. In stage two, the higher layers of the pre-trained deep neural networks are fine-tuned with limited labeled infrared images to classify each server operation state. A stratified five-fold cross-validation resampling method is employed to evaluate the effectiveness and generalization of deep neural network architectures. The performance of the proposed method is evaluated and compared to a traditional support vector machine classifier trained on hand-crafted features. The automatic feature extraction and the knowledge transfer capabilities of our approach are instrumental in the attainment of superior performance results, with the DenseNet-201 architecture achieving the highest average validation accuracy of 99.60% across five dataset sizes. The experimental results not only indicate the effectiveness and the robustness of deep neural networks trained with a small set of data, but also open up the possibility for DC operators to consider non-contact intelligent approaches to improving thermal management, energy efficiency, and system reliability of servers in DCs using infrared thermal sensor and machine learning.
AB - The recent spike in the demand for high-performance computing (HPC) server systems has birthed many challenges in data center (DC) facilities. These challenges include but are not limited to thermal management, system reliability sustenance, and server failure minimalization. In an attempt to solve the latter challenge, this paper proposes a deep convolutional neural network-based transfer learning approach for the automatic diagnosis of five server operation states: partial CPU load; maximum CPU load; main fan failure; CPU fan failure; and server entrance block- age. This transfer learning approach involves two main stages. The first stage consists of a deep neural network pretrained on the large ImageNet dataset that automatically extracts lower-level features. In stage two, the higher layers of the pre-trained deep neural networks are fine-tuned with limited labeled infrared images to classify each server operation state. A stratified five-fold cross-validation resampling method is employed to evaluate the effectiveness and generalization of deep neural network architectures. The performance of the proposed method is evaluated and compared to a traditional support vector machine classifier trained on hand-crafted features. The automatic feature extraction and the knowledge transfer capabilities of our approach are instrumental in the attainment of superior performance results, with the DenseNet-201 architecture achieving the highest average validation accuracy of 99.60% across five dataset sizes. The experimental results not only indicate the effectiveness and the robustness of deep neural networks trained with a small set of data, but also open up the possibility for DC operators to consider non-contact intelligent approaches to improving thermal management, energy efficiency, and system reliability of servers in DCs using infrared thermal sensor and machine learning.
KW - Convolutional neural network
KW - Data Center
KW - Infrared thermography
KW - Server fault diagnosis
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85205340790&partnerID=8YFLogxK
U2 - 10.1007/s00500-024-09792-y
DO - 10.1007/s00500-024-09792-y
M3 - Article
AN - SCOPUS:85205340790
SN - 1432-7643
JO - Soft Computing
JF - Soft Computing
ER -