Automating Cyber Threat Intelligence Collection from Open-Source Feeds
This project implements an automated Cyber Threat Intelligence (CTI) collection system designed to collect, normalize, store, and visualize Indicators of Compromise (IoCs) from multiple open-source threat intelligence feeds.
The system addresses the fragmentation, volume, and heterogeneity of open-source CTI by providing a fully automated, modular pipeline that improves efficiency, reduces manual workload, and enables actionable security insights through analytics and visualization.
This project was developed as part of a Bachelor’s Thesis in Systems and Computing Engineering at Universidad de los Andes.
- Automated ingestion from multiple OSINT feeds
- IoC normalization and automatic type detection
- Deduplication and historical tracking
- Centralized SQLite database
- Scheduled and on-demand collection
- Comprehensive logging and statistics
- Interactive Streamlit dashboard
- Data export (CSV / JSON)
The system follows a modular, layered architecture:
- Data Sources (OTX, AbuseIPDB, MalwareBazaar)
- Ingestion Layer (API-based collectors)
- Processing Layer (Normalization and validation)
- Storage Layer (SQLite + deduplication)
- Presentation Layer (Streamlit dashboard)
- Management Layer (Scheduler, logging, configuration)
.
├── main.py # System entry point
├── api_ingestion.py # Threat feed ingestion (OTX, AbuseIPDB, MalwareBazaar)
├── normalization.py # IoC detection, validation, normalization
├── database.py # SQLite schema and data access layer
├── scheduler.py # Collection orchestration and scheduling
├── dashboard.py # Streamlit visualization dashboard
├── config.py # Centralized configuration
├── cti_thesis.db # SQLite database (generated at runtime)
├── logs/
│ └── cti_collector.log # System logs
└── README.md
- IP addresses (IPv4 / IPv6)
- URLs
- Domains
- File hashes (MD5, SHA1, SHA256)
| Source | IoC Type | Authentication |
|---|---|---|
| AlienVault OTX | URLs | API Key |
| AbuseIPDB | IPs | API Key |
| MalwareBazaar | Hashes | None |
git clone https://github.com/your-repo/cti-collection-system.git
cd cti-collection-systempip install requests schedule streamlit pandas plotlyexport OTX_API_KEY="your_otx_key"
export ABUSEIPDB_API_KEY="your_abuseipdb_key"If not set, the system will use the keys defined in config.py (for academic/demo purposes).
python main.pyThis will:
- Execute an initial collection
- Schedule daily collection at 09:00
- Run continuously until interrupted
python main.py singlepython main.py statsstreamlit run dashboard.pyThe dashboard provides:
- IoC trends over time
- Distribution by type, source, threat level, and confidence
- Top IoCs by frequency
- Collection success metrics
- CSV and JSON export
Stores normalized and deduplicated IoCs with historical tracking.
Key fields:
- indicator
- type
- source
- first_seen
- last_seen
- seen_count
- confidence
- threat_level
- metadata
Tracks each automated collection run:
- source
- time
- processed / new / updated IoCs
- status and errors
- Processes 8,000–12,000 IoCs per run
- Accumulated 37,000+ unique IoCs
- Demonstrated effective deduplication (average seen count ≈ 4)
- Automated collection significantly outperformed manual methods
- Dashboard enabled rapid identification of trends and dominant threat sources
- Free-tier API rate limits
- Some feeds provide limited contextual information
- SQLite is suitable for prototyping but not large-scale deployment
- Machine-learning-based threat classification
- Integration of additional OSCTI feeds
- Alerting and automated defensive actions
- Migration to PostgreSQL or Elasticsearch
- SOC / SIEM integration
Esteban Orjuela Perdomo
Systems and Computing Engineering
Universidad de los Andes
Advisor:
Carlos Andrés Lozano Garzón