# 1. Install dependencies
uv sync
# 2. Start the local database
cd local_database
docker compose up -d
cd ..
# 3. Create your .env file (see Environment Variables below)
# 4. Run the app
fastapi dev main.pyThen open http://localhost:8000/api for the interactive API docs.
Create a .env file in the repository root. See ENV.md for the full reference.
At minimum, you need the database connection variables:
POSTGRES_USER=test_source_collector_user
POSTGRES_PASSWORD=HanviliciousHamiltonHilltops
POSTGRES_DB=source_collector_test_db
POSTGRES_HOST=127.0.0.1
POSTGRES_PORT=5432
DEV=trueThese match the defaults in local_database/docker-compose.yml.
You'll need additional keys depending on which features you're working on:
| Variable | Required For |
|---|---|
DS_APP_SECRET_KEY |
Any authenticated endpoint |
GOOGLE_API_KEY, GOOGLE_CSE_ID |
Auto-Googler collector |
DEEPSEEK_API_KEY or OPENAI_API_KEY |
LLM-powered tasks |
HUGGINGFACE_INFERENCE_API_KEY |
ML classification tasks |
HUGGINGFACE_HUB_TOKEN |
Uploading to HuggingFace |
PDAP_EMAIL, PDAP_PASSWORD, PDAP_API_KEY, PDAP_API_URL |
Syncing to the Data Sources App |
DISCORD_WEBHOOK_URL |
Error notifications |
INTERNET_ARCHIVE_S3_KEYS |
Internet Archive integration |
All features are enabled by default. To disable a feature during development, set its flag to 0:
SCHEDULED_TASKS_FLAG=0 # Disable all scheduled tasks
POST_TO_DISCORD_FLAG=0 # Disable Discord notificationsSee ENV.md for the full list of flags.
This gives you an empty database — good for running tests and isolated development.
cd local_database
docker compose up -dThe database schema is automatically created on app startup via Alembic migrations.
To stop the database:
cd local_database
docker compose downThis gives you a local copy of production data — useful for debugging or working with realistic data.
python start_mirrored_local_app.pyThis script:
- Starts the local database container.
- Runs the DataDumper to pull a snapshot from production (cached for 24 hours).
- Restores the snapshot into your local database.
- Applies any pending Alembic migrations.
- Starts the FastAPI server.
The mirrored approach requires additional environment variables for the production database connection. See the Data Dumper section in ENV.md.
This project uses Alembic for database migrations.
alembic revision --autogenerate -m "Description for migration"Then review the generated file in alembic/versions/ and adjust the upgrade() and downgrade() functions as needed.
Migrations are applied automatically on app startup. To apply manually:
python apply_migrations.pyOr using alembic directly:
alembic upgrade headSee alembic/README.md for more details.
.
├── src/ # Application source code
│ ├── api/ # FastAPI routers and endpoints
│ ├── core/ # Integration layer and task system
│ ├── db/ # Database models, client, queries
│ ├── collectors/ # URL collection strategies
│ ├── external/ # External service clients
│ ├── security/ # Authentication and authorization
│ └── util/ # Shared utilities
├── tests/ # Test suite
├── alembic/ # Database migrations
├── local_database/ # Docker setup for local PostgreSQL
├── docs/ # Documentation (you are here)
├── main.py # Alternative entry point
├── docker-compose.yml # Test environment (app + database)
├── Dockerfile # Production container
└── ENV.md # Full environment variable reference
- Create a directory under
src/api/endpoints/<group>/. - Follow the existing pattern:
routes.pyfor the router, subdirectories for each HTTP method. - Include the router in
src/api/main.py.
See collectors.md for the full guide.
- Create a new task operator in
src/core/tasks/scheduled/impl/. - Register it in the scheduled task loader (
src/core/tasks/scheduled/loader.py). - Add a corresponding flag in
EnvVarManagerand document it inENV.md.
- Create a new operator in
src/core/tasks/url/operators/. - Register it in the URL task loader (
src/core/tasks/url/loader.py). - Add a corresponding flag in
EnvVarManagerand document it inENV.md.