Project Title: Feature Selection based Phishing URL Detection using Supervised Machine Learning Methods
Team Members
Dr. G. Padmavathi, Dean - PSCS, Professor, Department of Computer Science
Dr. P. Subashini, Professor, Department of Computer Science
Ms. A. Roshni, Research Assistant, Centre for Cyber Intelligence, DST - CURIE - AI
Ms. N. Nandhini, Master of Computer Application
Project Summary
Due to the innovations in digital technologies the digital world is fast expanding and evolving towards cyber crimes. Cyber criminals are relied on the illegal use of digital assets, particularly personal credentials, financial data etc. Cyber criminals have expanded their data collection methods, but social engineering attacks remain their preferred way. Phishing is a sort of social engineering crime in which an attacker attempts to steal someone's identity. Phishing is one of the major cyber attacks with many internet users falling victim to it. Phishing attack mostly target EMAILS, WEBSITE, URLs, SMS, VOICE and so on. Phishers develop cloned websites and distribute the URL(s) to a large number of people by email, text, or social media.
The aim of the project is to detect the Phishing URLs based on the various feature selection methods using supervised machine learning methods. Machine learning is the branch of artificial intelligence which helps to detect the phishing attack without any human intervention. The process of phishing URLs detection using supervised machine learning methods comprises of five phases. The Phase 1 is the data collection in which Phishing URL dataset is used acquired from kaggle repository. Phase 2, deals with data preprocessing to remove the irrelevant data. In Phase 3, various feature selection techniques includes filter, wrapper and embedded feature selection methods are used to identify the significant features of the dataset which derive the appropriate result. Phase 4, deals with model building using supervised machine learning methods includes K-Nearest Neighbor (K-NN), Random forest and Logistic regression. In Phase 5, the comparative analysis is made between the supervised machine learning models to suggest the suitable model for Phishing URL detection. The Evaluation of the models are based on the performance metrics such as accuracy, precision, recall, f1 score and ROC curve in an effective way. Based on the comparative analysis embedded based feature selection attains 88% accuracy and Random forest Supervised Machine Learning model performs better with 97% accuracy in detecting Phishing URLs effectively with the proposed methodology.