Detect Malicious URLs using Machine Learning

This project is created as part for the course Digital Forensics.

Install the required packages

The required packages are located in the requirements.txt file. You can use this command to install them.

python3 -m pip install -r requirements.txt

Goal

The goal of this project is to use Machine Learning models for detecting whether a URL is malicious or not. A dataset of 447502 records of URLs is used which is taken from Kaggle.

Features

Using the dataset 20 lexical fetures were extracted from URLs. The classifiers used are: Decision Tree, Random Forest, Naive Bayes and Logistic Regression.

Feature Name	Explanation
length	Length of the entire URL
primary_domain_length	Length of the primary domain
primary_domain_length_url_length_ratio	The ratio between the length of the primary domain and the length of the entire URL
path_length	Length of the path path
path_length_url_length_ratio	The ratio between the length of the path and the length of the entire URL
number_digits	Number of characters that are numbers
number_special_characters	Number of special characters: ':', '//', '.', ':', '/', '?=', ',', ';', '(',')',']','+'
number_special_characters_path	Number of special characters in the path
number_dots	Number of dots in the URL
number_subdomains	Number of subdomains
url_scheme	Address protocol or scheme, whether https or http
host_is_ip	Is the host part an IP address or a character string
host_has_port	Does the host have a port specified
digit_letter_ratio	Ratio between the number of digits and letters
number_path_subdirectories	Number of subdirectories in the path
number_single_character_directories	Number of subdirectories that are single character only
number_queries	Number of query parameters
is_encoded	Are there any characters that are encoded, such as %20 which indicates a space character
number_encoded_char	Number of encoded characters
url_entropy	URL entropy

Conclusion

The results obtained are good, however, host-based features are not used such as DNS records, domain creation date, nor are features based on the content the URL points to. Such features will introduce a time delay. Using only lexical features can give fast results because they can be calculated quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
DetectMaliciousURL.ipynb		DetectMaliciousURL.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detect Malicious URLs using Machine Learning

Install the required packages

Goal

Features

Conclusion

About

Releases

Packages

Languages

kolo-vrat/detect-malicious-url-ml

Folders and files

Latest commit

History

Repository files navigation

Detect Malicious URLs using Machine Learning

Install the required packages

Goal

Features

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages