ML-Powered Intrusion Detection Systems

Attack Pattern Recognition & Performance

📄 Research Document ⏱️ 15 min read 📂 Network Security

Implementation strategies for ML-powered intrusion detection - attack pattern recognition, performance metrics, and production deployment considerations.

IDSIntrusion DetectionAttack PatternsML Security

🎯 Key Insight: This document is part of the Phoenix Technical Documentation Library - a curated collection of peer-reviewed research papers and official guidelines for AI/ML implementation in healthcare, security, and enterprise systems.

Full Document

International Journal of Electrical and Computer Engineering (IJECE) Vol. 14, No. 5, October 2024, pp. 5894~5905 ISSN: 2088-8708, DOI: 10.11591/ijece.v14i5.pp5894-5905  5894 Journal homepage: http://ijece.iaescore.com Fortifying network security: machine learning-powered intrusion detection systems and classifier performance analysis Arar Al Tawil1, Lara Al-Shboul2, Laiali Almazaydeh3, Mohammad Alshinwan1,4 1Faculty of Information Technology, Applied Science Private University, Amman, Jordan 2King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan 3College of Information Technology, Al-Hussein Bin Talal University, Ma’an, Jordan 4MEU Research Unit, Middle East University, Amman, Jordan Article Info ABSTRACT Article history: Received Feb 21, 2024 Revised Jun 18, 2024 Accepted Jul 1, 2024 Intrusion detection systems (IDS) protect networks from threats; they actively monitor network activity to identify and prevent malicious actions. This study investigates the application of machine learning methods to strengthen IDS, explicitly emphasizing the comprehensive CICIDS 2017 dataset. The dataset was refined by implementing stringent preprocessing methods such as feature normalization, class imbalance management, feature reduction, and feature selection to ensure its quality and lay the foundation for developing robust models. The performance evaluation of three classifiers-support vector machine (SVM), extreme gradient boosting (XGBoost), and naive Bayes was highly impressive. Vital accuracy, precision, recall, and F1-score values of 0.984389, 0.984479, 0.984375, and 0.984304, respectively, were achieved by SVM. Notably, XGBoost demonstrated exceptional performance across all metrics, attaining flawless scores of 1.0. naive Bayes demonstrated noteworthy accuracy, precision, recall, and F1-score performance, which were recorded as 0.877392, 0.907171, 0.877007, and 0.876986, respectively. The results of this study emphasize the critical importance of preparation methods in improving the effectiveness of IDS via machine learning. This further demonstrates the potential of particular classifiers to detect and prevent network intrusions efficiently, thereby substantially contributing to cybersecurity measures. Keywords: Class imbalance handling Classification Feature selection Intrusion detection systems Preprocessing This is an open access article under the CC BY-SA license. Corresponding Author: Arar Al Tawil Faculty of Information Technology, Applied Science Private University Amman, Jordan Email: ar_altawil@asu.edu.jo

INTRODUCTION An intrusion detection system (IDS) is a device or application that monitors a network or systems to detect and prevent malicious activity or policy violations are known as an IDS. It is possible to find IDS variants customized to accommodate diverse levels of security, spanning from individual computer systems to vast networks. The primary classifications are network intrusion detection systems (NIDS) and host-based intrusion detection systems (HIDS). Frequently, intrusion detection systems are divided into two categories based on the detection method [1]. An IDS may be deployed as a physical device or a software application to oversee system operations or network activity. Its principal objective is identifying and responding to malevolent activities or violations of predetermined regulations. IDS varieties demonstrate various applications, encompassing server rooms and enterprise networks. HIDS and NIDS are the primary standard categories. Frequently, classifications of intrusion detection systems are based on their
Int J Elec & Comp Eng ISSN: 2088-8708  Fortifying network security: machine learning-powered intrusion detection … (Arar Al Tawil) 5895 detection methodologies [2], [3]. The fundamental classification is predicated on signature detection, which compares distinctive patterns in network traffic (e.g., byte sequences) with an established repository of recognized attack signatures. In contrast, the anomaly-based detection approach assesses the current state of a network about a predetermined reference point. This empowers it to identify and discern both established and novel perils. Furthermore, it is critical to note that dimensional reduction is a prevalent technique utilized in machine learning, mainly when dealing with feature spaces comprising numerous dimensions. Learning by machines is analogous to instructing computers to improve their task performance without being explicitly instructed on each step. It is everything about developing programs that can use data to enhance intelligence. Observing the data and learning from it to identify patterns and generate more precise predictions is the initial step in the learning process. The primary objective is for computers to acquire knowledge autonomously and adjust their behavior without requiring continuous human supervision [4]. Preprocessing can significantly impact the overall predictive performance of a supervised machine learning algorithm in the context of generating hypotheses using novel data. One of the most formidable challenges encountered in inductive machine learning pertains to detecting and eliminating chaotic instances. These cases commonly demonstrate substantial departures from the standard, frequently distinguished by many absent or inconsequential attribute values. Often, these exceptionally aberrant characteristics are denoted as outliers. In addition, in situations where working with huge datasets is impractical, it is typical to select a representative sample from the massive set while also addressing the problem of missing data [5]. Our study employed A comprehensive preprocessing strategy to improve data quality and maximize the efficiency of our machine-learning models. The approach utilized various methods, including data normalization for consistent scaling. Data normalization entails reducing the magnitude of numerical characteristics in a dataset to a standard range, typically from 0 to 1. This mechanism prevents any one feature from exerting an excessive influence on machine learning models by ensuring that all features have an equal impact [6]. Feature selection by correlation entails identifying and retaining the most pertinent characteristics present in a given dataset. The primary objective is to decrease the dimensionality of the data without altering the attributes that maintain the most robust associations with the target variable. This streamlines the process of modeling [7]. Managing missing data techniques entails the implementation of approaches to address data instances or attributes that contain null or incomplete values. Conventional approaches to managing missing values encompass imputation and exclusion. Imputation entails employing statistical techniques to compensate for missing values, while exclusion entails excluding instances containing missing data from the analysis [5]. Class imbalance strategies aim to alleviate the problem when one class is significantly underrepresented relative to the others in a given dataset. These methods aim to restore equilibrium to the class distribution so that machine-learning models can generate accurate predictions for all classes, including those with fewer instances and do not favor the majority class. Methods include oversampling, undersampling, and applying suitable evaluation metrics [8]. Implementing these preprocessing procedures was critical in empowering our machine learning models to generate precise and resilient forecasts, even when confronted with intricate and practical datasets. This paper tackles the critical issue of improving IDS to ensure that they can accurately identify and mitigate network intrusions, which is essential for maintaining a robust cybersecurity system. The proposed solution entails using machine learning techniques, with a particular emphasis on preprocessing methods such as feature normalization, class imbalance management, feature reduction, and feature selection, to enhance the quality of the data and construct robust models. The study assesses the efficacy of three classifiers: Naive Bayes, extreme gradient boosting (XGBoost), and support vector machine (SVM). The results suggest that SVM obtained high accuracy, precision, recall, and F1-score, whereas XGBoost exhibited extraordinary performance with flawless scores across all metrics. Although Naive Bayes was less effective than the other two, it still demonstrated significant precision and accuracy. This research expands upon previous research by utilizing rigorous preprocessing techniques and assessing the efficacy of various classifiers on the CICIDS 2017 dataset. The results emphasize the superior performance of XGBoost and the critical role of data preparation in enhancing the effectiveness of IDS. Following this, the remaining sections are structured as follows: an examination of the literature about intrusion detection systems and machine learning algorithms is presented in section 2. The methodology utilized in this study is delineated in section 3, encompassing the selection of datasets, preprocessing procedures, and experimental configuration. The evaluation and implementation of multiple machine learning classifiers for intrusion detection are described in section 4. The results and analysis of the experiments are detailed in section 5, emphasizing performance metrics, including accuracy, false positives, and detection rate (DR). In conclusion, the paper is summarized in section 6, which also analyzes the main findings' implications and proposes potential directions for future research.
 ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 5, October 2024: 5894-5905 5896

LITERATURE REVIEW The study [9] utilizes a variety of established machine learning classification algorithms, including the Bayesian network, naive Bayes classifier, decision tree (DT), random decision forest, random tree, decision table, and artificial neural network (ANN). The objective is to identify intrusions and improve cyber-security services. The researchers do tests using the KDD'99 cup dataset, which encompasses a wide range of cyber-attack categories. The assessment of these algorithms includes performance indicators such as precision, recall, F1-score, and accuracy. The random forest (RF) classifier stands out as the best performer, with an excellent accuracy of 0.94. This highlights its effectiveness in the field of cyber-security intrusion detection. Regarding the research referenced as [10], the authors have presented a feature selection model known as ID3-BA. This model is intricately crafted to maximize the selection of a subset of attributes within the area of IDS. The technique integrates the ID3 classifier algorithm with the bees algorithm, with the bees algorithm being crucial in creating the necessary subset of features, while the ID3 algorithm is used to create the classifier. The study used the KDD Cup99 dataset, a well-recognized dataset in the field of knowledge discovery and data mining. This dataset consists of 41 characteristics that are used for both training and testing. The performance assessment criteria consist of three primary metrics: false alarm rate (FAR), detection rate (DR), and accuracy (AR). The empirical data obtained from this study activity provide persuasive outcomes. The ID3-BA model regularly achieves a high detection rate of 91.02% and an exceptional accuracy rate of 92.002%. It also maintains a low FAR of 3.917%. The research's main conclusion emphasizes that carefully choosing a subset of characteristics, rather than employing all of them, greatly improves the effectiveness of IDS in terms of detection rate, accuracy, and a decrease in false alarm rate. As described in reference [2], the authors have developed a methodical approach for selecting features in the field of IDS. This strategy entails the collaborative employment of a clustering algorithm performed via filter and wrapper approaches. The wrapper approach utilizes the linear correlation coefficient algorithm (FGLCC), while the filter method utilizes the cuttlefish algorithm (CFA). The suggested technique also integrates a decision tree for constructing the classifier, and its performance evaluation is based on the well-established KDD Cup 99 dataset. Throughout the experimental phase, performance assessment involves crucial variables such as accuracy, detection rate, false positives, and a fitness function. The assessment findings are rigorously compared to those acquired by the 10-fold cross-validation approach and other techniques based on features. The result of this meticulous testing is convincing. The FGLCC-CFA algorithm regularly surpasses other approaches, with a remarkable detection rate of 95.23%, an accuracy rate of 95.03%, and an incredibly low false positive rate of 1.65%. The results highlight the effectiveness of the suggested technique in improving the performance of IDS and its significant benefits compared to alternative feature selection algorithms. The main aim of research [11] is to improve the effectiveness of IDS by combining rule-based approaches with learning-based algorithms for the purpose of detecting and categorizing intrusions. The study utilizes neural networks (NN), RF, and SVM techniques to accomplish this objective. The study used conventional datasets, such as KDD Cup 99, as input in their system. Before doing analysis, the KDD 99 dataset undergoes a preprocessing stage to remove data noise and assure data consistency. Consequently, the researchers get pristine and uniform input data. The processed data is then inputted into machine learning algorithms such as SVM, NN, and RF to carry out classification tasks. The categorization results are then used as training data for prediction tasks. Practically, when fresh data about infiltration attempts is included into the framework, the system utilizes the acquired patterns from the training data to forecast if the new data is typical or atypical. The remarkable accomplishment of this study is the SVM algorithm achieving the best accuracy score of 0.94, highlighting its usefulness in detecting and classifying intrusions in this specific context. Chung et al. [12] have created simplified swarm optimization (SSO), a new and efficient variation of particle swarm optimization (PSO) designed explicitly for feature selection. This approach integrates a localized search technique to accelerate the selection process of features by discovering the most optimum surrounding solution. The suggested SSO method has a crucial capability to significantly decrease the number of characteristics needed to capture the behavioral patterns in network traffic data accurately. More precisely, it reduces the original collection of 41 factors from the KDD Cup 99 dataset to just six elements while also attaining higher accuracy than the typical PSO technique. The SSO technique achieves a notable accuracy rate of 93.3%, highlighting its usefulness in enhancing feature selection for enhanced performance in network traffic analysis and intrusion detection. Almseidin et al. [13] provided valuable insights by examining specific classifiers. The decision table classifier demonstrated superior performance by achieving the lowest false negative rate. On the other hand, the RF classifier excelled with an accuracy rate of 93.77%, backed by the least root mean square error
Int J Elec & Comp Eng ISSN: 2088-8708  Fortifying network security: machine learning-powered intrusion detection … (Arar Al Tawil) 5897 (RMSE) and minimum false positives. The random tree classifier had the lowest mean accuracy rate yet with the smallest receiver operating characteristic (ROC) value. Meanwhile, the multi-layer perceptron (MLP) and naive Bayes classifiers showed similar average accuracy rates. The Bayes network algorithm has demonstrated exceptional performance in accurately recognizing regular packets. On the other hand, while the Decision Table algorithm did not achieve the maximum level of accuracy, it exhibited the lowest rate of false negatives and efficient model construction. Ultimately, rule-based classifiers such as the decision table provide a favorable balance by achieving satisfactory accuracy and instilling a greater sense of certainty, mainly because they have the lowest rates of false negatives when used for intrusion detection. Agarwal et al. [14] featured a thorough examination that used three different machine learning classification algorithms: naïve Bayes (NB), SVM, and k-nearest neighbor (KNN). The main objective was to determine their efficacy in improving accuracy and reducing processing time using the UNSW-NB15 dataset. The primary goal was to identify the most appropriate algorithm for acquiring knowledge about the complexities of suspicious network activity. The selection of the most suitable algorithm for training the IDS was facilitated by conducting a comparative study of feature sets. The selected algorithm was then used to forecast and analyze future incursion behavior. During the testing phase of the model, performance measures such as accuracy, recall, and F1-score were systematically produced. Additionally, confusion matrices were created and compared to determine the best validation and support status achieved. The derived results show that the SVM outperformed the other algorithms, achieving an impressive accuracy rate of 0.977. This highlights the outstanding appropriateness of SVM in the study model, showcasing its capacity to handle the dataset successfully and improve intrusion detection skills. Emanet et al. [15] focuses on developing a sophisticated IDS that prioritizes enhanced accuracy using strategic feature selection and ensemble learning techniques. Using the CIC-CSE-IDS2018 dataset, the study progresses through two crucial phases, substantially contributing to its overall effect. The first refining of the dataset entails carefully selecting features and using ensemble learning methods to enhance the performance of IDS by combining the capabilities of several classifiers. Implementing ensemble learning afterward results in a resilient model, improving attack detection and substantially decreasing detection time. The suggested ensemble model achieves an impressive accuracy rate of 98.82% by using under-sampling and feature selection techniques. This results in a significant decrease of 73% in intrusion detection time and a modest improvement of 3% in accuracy. Spearman's correlation analysis, recursive feature elimination (RFE), and chi-square test procedures are used to determine the essential elements that enhance the efficiency of IDS. A comparative comparison of classifiers, such as additional trees, decision trees, and logistic regression, demonstrates reasonable accuracy rates while considering actual implementation time. The significance of this research is its contribution to the advancement of IDS capabilities through the proposal of an ensemble learning model that surpasses individual classifiers. This affirms the model's potential impact on future intrusion detection systems and strengthens computer security across various domains. Additionally, it paves the way for innovative approaches in the field. Fitni and Ramli [16] addresses the growing concerns about data security in organizational information systems. It emphasizes the necessity for more robust defensive mechanisms to counter sophisticated assaults that may bypass standard security technologies such as firewalls and antivirus software. This study aims to overcome the constraints of existing IDSs by using an ensemble learning technique. The approach combines logistical regression, decision trees, and gradient boosting as effective classifiers. Using the CSE-CIC-IDS2018 dataset and employing Spearman's rank correlation coefficient, the research improves the model by carefully choosing 23 essential characteristics from a pool of 80, considerably boosting its concentration. The experimental results illustrate the strength of the ensemble model, displaying exceptional performance metrics: a final accuracy of 98.8%, precision, and recall rates of 98.8% and 97.1%, respectively, resulting in an excellent F1-score of 97.9%. These results highlight the effectiveness of ensemble learning in strengthening IDS capabilities, making significant progress in tackling current difficulties and enhancing network security. Al Tawil and Sabri [17] introduces a novel feature selection algorithm for IDS that employs the moth flame optimization (MFO) algorithm. The objective of the proposed algorithm is to reduce the time required for training and improve the precision of the model by selecting pertinent features. The algorithm was evaluated on the CIC-2017 dataset, resulting in a reduction of the number of features from 78 to 4. It obtained a high detection rate (100%) and accuracy (99.9%) with a lower false alarm rate. Table 1 provides a comprehensive summary of the machine learning algorithms applied to intrusion detection systems, as documented in the relevant literature. Every cell in the table represents a distinct study, providing comprehensive information regarding the algorithms utilized, datasets incorporated, performance metrics assessed, and significant discoveries attained. This comparative analysis illuminates the efficacy of various methodologies in detecting and classifying intrusions, providing essential perspectives for improving cybersecurity protocols.
 ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 5, October 2024: 5894-5905 5898 Table 1. Machine learning approaches in intrusion detection Ref. Algorithms used Dataset Performance metrics Key findings 9 Bayesian network, NB, DT, random decision forest, random tree, decision table, ANN KDD'99 cup Precision, Recall, F1-score, Accuracy RF classifier achieved the highest accuracy of 0.94. 10 ID3-BA (ID3 classifier + bees algorithm) KDD Cup99 FAR, DR, AR ID3-BA model achieved a DR of 91.02%, AR of 92.002%, and FAR of 3.917%. 2 FGLCC-CFA (Filter: FGLCC, Wrapper: CFA) KDD CUP99 Accuracy, DR, False Positives, Fitness Function FGLCC-CFA algorithm achieved a DR of 95.23%, AR of 95.03%, and false positives rate of 1.65%. 11 NN, RF, SVM KDD Cup 99 Accuracy SVM algorithm achieved the highest accuracy score of 0.94. 12 SSO KDDCUP 99 Accuracy SSO achieved an accuracy rate of 93.3% and reduced the number of features from 41 to 6. 13 Decision table, RF, random tree, MLP, NB, Bayes network

References & Citation

Source: Phoenix Technical Documentation Library
Category: Network Security
Original: Peer-reviewed research paper / Official guideline
License: CC BY 4.0 (unless otherwise noted)

Suggested Citation:
ML-Powered Intrusion Detection Systems. Phoenix Technical Documentation Library, Avondale.AI. Accessed May 2026. https://avondale.ai/technical/