Hybrid Stacking Ensemble Model for Phishing URL Detection Using PCA and Machine Learning

Mohammed M Elsheh; Ebtisam Abolawaifa

doi:10.26629/jtr.2025.48

Authors

Mohammed M Elsheh Computer Science, The Libyan Academy for Postgraduate Studies, Misurata, Libya Author
Ebtisam Abolawaifa Computer Science, The Libyan Academy for Postgraduate Studies, Misurata, Libya Author

DOI:

https://doi.org/10.26629/jtr.2025.48

Keywords:

Phishing URLs, Voting classifier, Stacking classifier, Logistic Regression, Artificial Neural Network, Random Forest, Cyber Security

Abstract

The rapid growth of internet usage has transformed cybercrime into a formidable global challenge. Among digital threats, phishing stands out as particularly dangerous due to its deceptive approach. Cybercriminals employ fake yet convincing URLs to steal users' sensitive information, causing significant financial and personal damage. This escalating threat demands advanced countermeasures. This study addresses this critical need by proposing a hybrid machine learning (ML) model specifically designed to improve malicious URL identification. The innovative model integrates three powerful algorithms: Logistic Regression (LR), Artificial Neural Networks (ANN), and Random Forest (RF). These are combined within an advanced stacking ensemble architecture that strategically leverages each algorithm's unique analytical capabilities. This multi-layered approach enables comprehensive threat analysis from different perspectives. To optimize model efficiency and performance, we implemented Principal Component Analysis (PCA) for intelligent feature selection, ensuring optimal computational resource utilization. Our research utilized a substantial dataset of over 11,000 carefully labelled URLs sourced from Kaggle. The dataset underwent meticulous preparation, including appropriate balancing techniques to mitigate class imbalance issues that could compromise model accuracy. Through rigorous evaluation using key performance metrics accuracy, precision, recall, and F1-score the model demonstrated exceptional efficacy. Remarkably, the hybrid ensemble achieved an outstanding accuracy of approximately 99.55%, significantly surpassing all individual base models. This superior performance highlights the model's strong potential for immediate deployment in real-time phishing detection systems. It offers organizations a proactive and reliable defence mechanism in the ongoing battle against evolving cyber threats, representing a significant advancement in cybersecurity protection for today's sophisticated digital landscape.