This project asynchronously scrapes web content, generates semantic text chunks using sentence embeddings, and stores them in a Milvus vector database for efficient similarity search. Built with Python, Langchain, SentenceTransformers, and Milvus for scalable vector-based retrieval.
This project scrapes content from multiple websites asynchronously, tokenizes and embeds the content into semantic chunks using Sentence Transformers, and stores them in a Milvus vector database for efficient similarity search and retrieval.
- Python 3.9
- aiohttp
- nltk
- pandas
- sentence-transformers
- pymilvus
- langchain
- scikit-learn
- numpy
🌐 Scrapes web content asynchronously
✂️ Tokenizes and chunks data using NLTK
🧠 Converts text into dense semantic embeddings
🧲 Stores embeddings in Milvus for fast vector search
🔄 Supports Dockerized deployment for consistent setup
- Asynchronous web scraping with
aiohttpandlangchain - Semantic chunking using NLTK sentence tokenization
- Embedding with
sentence-transformers/all-MiniLM-L6-v2 - Vector similarity search using Milvus
- Dockerized setup with Milvus, MinIO, Etcd, and Python environment
1. git clone https://github.com/yourusername/semantic-web-milvus.git
cd semantic-web-milvus
2. Start Docker Services
docker-compose up --build -d
This will spin up:
Milvus vector database
Etcd (metadata service)
MinIO (object storage)
Python container (milvus-python) with all dependencies pre-installed
3. Access Python Container
docker exec -it milvus-python bash
4. Run your main script:
python your_script.py
├── docker-compose.yml
├── Dockerfile.python
├── scripts/
│ └── your_script.py
├── volumes/
│ ├── etcd/
│ ├── milvus/
│ └── minio/
Building search engines over scraped web content
Knowledge base construction with semantic search
Content recommendation systems
🙋 Author
LinkedIn: http://www.linkedin.com/in/SwapnilTaware
GitHub: https://github.com/itsSwapnil
Email: tawareswapnil23@gmail.com
This project is licensed under the MIT License.