This guide provides comprehensive instructions for using TinySearch, a lightweight vector retrieval system designed for embedding, indexing, and searching over text data.
pip install tinysearch# With API support
pip install tinysearch[api]
# With embedding models support
pip install tinysearch[embedders]
# With all document adapters
pip install tinysearch[adapters]
# With all features
pip install tinysearch[full]TinySearch consists of several key components that work together to enable efficient text search:
- DataAdapter: Extracts text from various file formats (TXT, PDF, CSV, Markdown, JSON)
- TextSplitter: Chunks text into appropriate segments for embedding
- Embedder: Generates vector embeddings from text chunks
- VectorIndexer: Builds and maintains a FAISS index for efficient similarity search
- QueryEngine: Processes queries and retrieves relevant context
- FlowController: Orchestrates the entire data flow
TinySearch uses a YAML configuration file to control all aspects of the system. Here's a sample configuration:
# Data adapter configuration
adapter:
type: text # Options: text, pdf, csv, markdown, json, custom
params:
encoding: utf-8
# Text splitter configuration
splitter:
type: character
chunk_size: 300 # Characters per chunk
chunk_overlap: 50 # Overlap between chunks
separator: "\n\n" # Optional paragraph separator
# Embedding model configuration
embedder:
type: huggingface
model: Qwen/Qwen-Embedding # Or any HuggingFace model
device: cuda # Set to "cpu" if no GPU is available
normalize: true
# Vector indexer configuration
indexer:
type: faiss
index_path: index.faiss
metric: cosine # Options: cosine, l2, ip (inner product)
index_type: Flat # Options: Flat, IVF, HNSW
# Query engine configuration
query_engine:
method: template
template: "Please help me find: {query}"
top_k: 5
# Flow controller configuration
flow:
use_cache: true
cache_dir: .cacheTo build a search index from your documents:
tinysearch index --data ./your_documents --config config.yamlOptions:
--data: Path to a file or directory containing documents--config: Path to your configuration file--force: Force reprocessing of all files, ignoring cache
To search your indexed documents:
tinysearch query --q "Your search query" --config config.yaml --top-k 5Options:
--qor--query: Your search query--config: Path to your configuration file--top-k: Number of results to return (overrides config file)
To start the API server:
tinysearch-api --config config.yaml --port 8000Options:
--config: Path to your configuration file--port: Port to run the server on (default: 8000)--host: Host to bind to (default: 127.0.0.1)
Once the API server is running, you can query it using HTTP requests:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "Your search query", "top_k": 5}'Response format:
{
"results": [
{
"text": "The relevant text chunk",
"score": 0.95,
"metadata": {
"source": "/path/to/original/file.txt"
}
},
...
]
}curl -X POST http://localhost:8000/build-index \
-H "Content-Type: application/json" \
-d '{"data_path": "./your_documents", "force_reprocess": false}'You can create custom data adapters by implementing the DataAdapter interface:
from tinysearch.base import DataAdapter
class MyCustomAdapter(DataAdapter):
def __init__(self, special_param=None):
self.special_param = special_param
def extract(self, filepath):
# Your code to extract text from the file
# ...
return [text1, text2, ...]Then configure it in your config.yaml:
adapter:
type: custom
params:
module: my_module
class: MyCustomAdapter
init:
special_param: valueYou can use TinySearch directly in your Python code:
from tinysearch.adapters.text import TextAdapter
from tinysearch.splitters.character import CharacterTextSplitter
from tinysearch.embedders.huggingface import HuggingFaceEmbedder
from tinysearch.indexers.faiss_indexer import FAISSIndexer
from tinysearch.query.template import TemplateQueryEngine
from tinysearch.flow.controller import FlowController
# Create components
adapter = TextAdapter()
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
embedder = HuggingFaceEmbedder(model_name="Qwen/Qwen-Embedding", device="cpu")
indexer = FAISSIndexer()
query_engine = TemplateQueryEngine(indexer=indexer, embedder=embedder)
# Configuration
config = {
"flow": {
"use_cache": True,
"cache_dir": ".cache"
},
"query_engine": {
"top_k": 5
}
}
# Create FlowController
controller = FlowController(
data_adapter=adapter,
text_splitter=splitter,
embedder=embedder,
indexer=indexer,
query_engine=query_engine,
config=config
)
# Build index
controller.build_index("./your_documents")
# Query
results = controller.query("Your search query")
# Process results
for result in results:
print(f"Score: {result['score']:.4f}")
print(f"Text: {result['chunk'].text}")
print(f"Source: {result['chunk'].metadata.get('source', 'Unknown')}")
print("---")TinySearch includes a simple web-based user interface that makes it easy to search your indexed documents and manage your index without using the command line.
The web UI is built into the API server. To start it, run:
tinysearch-apiBy default, the server will start on http://localhost:8000. Open this URL in your web browser to access the UI.
The web interface consists of three main sections:
The search tab allows you to query your index. Simply enter your search query and select how many results you'd like to see. The results will show:
- The text content of each matching chunk
- The source document
- The relevance score (higher is better)
The Index Management tab provides tools to:
- Upload Document: Upload individual files to be indexed
- Build Index: Process a directory of files to build or update your index
- Clear Index: Remove all indexed documents and start fresh
The Stats tab shows information about your current index, including:
- The number of processed files
- Whether caching is enabled
- A list of all processed files
The web UI is built using Bootstrap 5 and vanilla JavaScript. If you'd like to customize its appearance or behavior, you can modify the files in the tinysearch/api/static directory.
-
Model Download Failures:
- Ensure you have internet connectivity when using HuggingFace models for the first time
- Set
cache_dirin the embedder config to a writeable directory
-
Out of Memory Errors:
- Reduce
batch_sizein the embedder configuration - Use a smaller embedding model
- Process smaller datasets or reduce chunk size
- Reduce
-
Slow Indexing:
- Enable caching with
use_cache: truein the flow configuration - Use a faster FAISS index type (like IVF), but note this may reduce accuracy
- Enable caching with
-
Poor Search Results:
- Adjust chunk size and overlap to better match your content
- Use a more appropriate embedding model for your domain
- Try different similarity metrics (cosine, L2, inner product)
If you encounter any issues or have questions, please:
- Check the documentation
- Open an issue on GitHub
TinySearch is licensed under the MIT License.