Phishing URL Detection System - Home

System Features

High Accuracy

99.8% Detection Precision

Our detection model is trained on data from millions of websites, capable of precisely identifying various phishing website features, with a detection accuracy of up to 99.8%, far surpassing traditional rule-based detection methods.

Real-time Detection

Millisecond-level Response

Based on an optimized neural network architecture, our system can complete URL feature extraction and risk assessment within milliseconds, providing immediate security protection.

Batch Detection

Supports Multi-URL Analysis

The system supports batch URL detection and file import functionality. Enterprise users can check hundreds or thousands of URLs simultaneously, improving efficiency and suitable for large-scale security audits.

System Statistics

200,000+

URLs Scanned

99.8%

Detection Accuracy

100,000+

Phishing Sites Identified

1,000+

Active Users

Phishing Website Type Distribution

Monthly Detection Trend

Technology Implementation

System Design

Our system utilizes a multi-stage processing workflow to detect phishing URLs through feature extraction and deep learning models. The system primarily includes the following components:

Feature Extraction Module

Manual Features

URL Length
Special Character Count
Numeric Ratio
Sensitive Word Detection
Domain Length

Automatic Features

Branch 1: Character-level Features
Branch 2: Word-level Features
Branch 3: N-grams Analysis
Branch 4: TF-IDF Vectorization

Domain Features

pr_pos: Position Feature
pr_val: Value Feature
harmonic_pos: Harmonic Position
harmonic_val: Harmonic Value

Detection Process

Data Input → Retrieve URL samples from URLhaus and Common CRAWL
Similarity Filtering → Ensure Data Set Diversity
Feature Extraction → Generate Multi-dimensional Feature Vectors
Model Prediction → Use MGCF-Net Deep Learning Model
Output Classification → Phishing URL/Legal URL

Data Set Construction

Our data set comes from URLhaus and Common CRAWL, and the construction process is as follows:

Data Source	Sample Quantity	Processing Method
URLhaus (Phishing URLs)	368,319 Samples	50% Data Set Composition
Common CRAWL (Legal URLs)	About 370,000 Samples	50% Data Set Composition
Similarity Filtering	About 200,000 Samples	Used for Model Training
Adversarial Sample Generation	About 32,000 Samples	Enhance Model Robustness

Data Set Division

Training Set

80%

Validation Set

10%

Test Set

10%

Adversarial Sample Generation

Character Replacement Technique

Replace characters like "o" with "0", "-" with "_", etc.

Subdomain Addition

Add brand as a subdomain to the domain

Consecutive Character Replacement

Replace dots in the domain with consecutive characters

Deceptive Path

Add a fake path like "/secure/login/"

Feature Extraction Method

N-grams (n=3) Analysis

We use trigrams to extract semantic information from URLs:

http://www.example.com/login.php?user=admin&action=delete

http www example com login php user admin action delete

By analyzing each part of the URL: protocol, hostname, path, and parameters, we can capture the structural features of the URL.

TF-IDF Vectorization

Term Frequency (TF):

Measures the frequency of term t in URL u.

TF(t,u) = Number of times term t appears in URL u / Total number of terms in URL u

Inverse Document Frequency (IDF):

Evaluates the rareness of term t in the entire corpus.

IDF(t) = log(Total number of URLs in the corpus / (Number of URLs containing term t + 1))

TF-IDF Weight:

TF-IDF(t,d) = TF(t,d) × IDF(t)

This weight can highlight key features in the URL while reducing the impact of common terms.

Neural Network Model

MGCF-Net

Use Word Embedding to Capture Semantic Information
CNN Extracts Local Context Features
BiLSTM Captures Global Sequence Information
Semantic Fusion Layer Integrates Features
Cross Attention Mechanism Enhances Feature Representation
Integrate Manual Features and Domain Knowledge

DeepCNN_Light_Hybrid

Lightweight Architecture, Suitable for Low Resource Environments
Deep CNN Extracts Semantic Representation
Combine Manual Features to Enhance Performance
Domain Name Reputation Assessment
Efficient Feature Concatenation Strategy

Our Development Team

Meet the talented professionals behind this phishing detection system

Huang Hao

Leading the Project

Oversees the entire project, coordinates the team, and ensures the project's goals are achieved through efficient collaboration and planning.

Leadership Strategy

Chen Zijie

Designing the Web Interface

Creates and designs the web interface, ensuring the user experience is intuitive, engaging, and aligned with project goals.

UI/UX Design Web Development

Zhao Chuyu

Analyzing the Data

Focuses on analyzing data to extract meaningful insights, helping guide decisions that drive the project forward.

Data Analysis Data Insights

Tan Mingshu

Managing Version Control

Handles version control and ensures smooth integration of the team’s work, maintaining a seamless development workflow.

Version Control Git Management

Li Zhuyi

Extracting Key Features

Focuses on identifying and extracting essential features from raw data to enhance model performance.

Feature Engineering Data Processing

Wen Tianshu

Collecting Valuable Sources

Gathering and curating the datasets needed for the project, ensuring a high quality and diverse data foundation.

Data Collection Source Management

Meng Yu

Drawing Illustrations

Creates visual illustrations, charts, and diagrams to represent data and models in an easily digestible format.

Illustration Data Visualization

Zhou Yitong

Creating Artistic Designs

Designs creative and artistic visuals to enhance the project's aesthetic appeal and visual communication.

Art Creative Design

Protect Your Online Security

Deep Learning-Based Phishing URL Intelligent Detection System