Protect Your Online Security
Deep Learning-Based Phishing URL Intelligent Detection System
We utilize advanced artificial intelligence technology to provide high-precision phishing website detection services, protecting your personal information security.
System Features
High Accuracy
99.8% Detection Precision
Real-time Detection
Millisecond-level Response
Batch Detection
Supports Multi-URL Analysis
System Statistics
Phishing Website Type Distribution
Monthly Detection Trend
Technology Implementation
System Design

Our system utilizes a multi-stage processing workflow to detect phishing URLs through feature extraction and deep learning models. The system primarily includes the following components:
Feature Extraction Module
Manual Features
- URL Length
- Special Character Count
- Numeric Ratio
- Sensitive Word Detection
- Domain Length
Automatic Features
- Branch 1: Character-level Features
- Branch 2: Word-level Features
- Branch 3: N-grams Analysis
- Branch 4: TF-IDF Vectorization
Domain Features
- pr_pos: Position Feature
- pr_val: Value Feature
- harmonic_pos: Harmonic Position
- harmonic_val: Harmonic Value
Detection Process
- Data Input → Retrieve URL samples from URLhaus and Common CRAWL
- Similarity Filtering → Ensure Data Set Diversity
- Feature Extraction → Generate Multi-dimensional Feature Vectors
- Model Prediction → Use MGCF-Net Deep Learning Model
- Output Classification → Phishing URL/Legal URL
Data Set Construction
Our data set comes from URLhaus and Common CRAWL, and the construction process is as follows:
Data Source | Sample Quantity | Processing Method |
---|---|---|
URLhaus (Phishing URLs) | 368,319 Samples | 50% Data Set Composition |
Common CRAWL (Legal URLs) | About 370,000 Samples | 50% Data Set Composition |
Similarity Filtering | About 200,000 Samples | Used for Model Training |
Adversarial Sample Generation | About 32,000 Samples | Enhance Model Robustness |

Data Set Division
Training Set
80%
Validation Set
10%
Test Set
10%
Adversarial Sample Generation
Character Replacement Technique
Replace characters like "o" with "0", "-" with "_", etc.
Subdomain Addition
Add brand as a subdomain to the domain
Consecutive Character Replacement
Replace dots in the domain with consecutive characters
Deceptive Path
Add a fake path like "/secure/login/"
Feature Extraction Method
N-grams (n=3) Analysis
We use trigrams to extract semantic information from URLs:
http://www.example.com/login.php?user=admin&action=delete
By analyzing each part of the URL: protocol, hostname, path, and parameters, we can capture the structural features of the URL.

TF-IDF Vectorization
Term Frequency (TF):
Measures the frequency of term t in URL u.
TF(t,u) = Number of times term t appears in URL u / Total number of terms in URL u
Inverse Document Frequency (IDF):
Evaluates the rareness of term t in the entire corpus.
IDF(t) = log(Total number of URLs in the corpus / (Number of URLs containing term t + 1))
TF-IDF Weight:
TF-IDF(t,d) = TF(t,d) × IDF(t)
This weight can highlight key features in the URL while reducing the impact of common terms.

Neural Network Model
MGCF-Net
- Use Word Embedding to Capture Semantic Information
- CNN Extracts Local Context Features
- BiLSTM Captures Global Sequence Information
- Semantic Fusion Layer Integrates Features
- Cross Attention Mechanism Enhances Feature Representation
- Integrate Manual Features and Domain Knowledge
DeepCNN_Light_Hybrid
- Lightweight Architecture, Suitable for Low Resource Environments
- Deep CNN Extracts Semantic Representation
- Combine Manual Features to Enhance Performance
- Domain Name Reputation Assessment
- Efficient Feature Concatenation Strategy
Our Development Team
Meet the talented professionals behind this phishing detection system

Huang Hao
Leading the Project
Oversees the entire project, coordinates the team, and ensures the project's goals are achieved through efficient collaboration and planning.

Chen Zijie
Designing the Web Interface
Creates and designs the web interface, ensuring the user experience is intuitive, engaging, and aligned with project goals.

Zhao Chuyu
Analyzing the Data
Focuses on analyzing data to extract meaningful insights, helping guide decisions that drive the project forward.

Tan Mingshu
Managing Version Control
Handles version control and ensures smooth integration of the team’s work, maintaining a seamless development workflow.

Li Zhuyi
Extracting Key Features
Focuses on identifying and extracting essential features from raw data to enhance model performance.

Wen Tianshu
Collecting Valuable Sources
Gathering and curating the datasets needed for the project, ensuring a high quality and diverse data foundation.

Meng Yu
Drawing Illustrations
Creates visual illustrations, charts, and diagrams to represent data and models in an easily digestible format.

Zhou Yitong
Creating Artistic Designs
Designs creative and artistic visuals to enhance the project's aesthetic appeal and visual communication.