High-level view of how data moves through Boost Data Collector: external sources, Django collector apps, shared storage, and vector search. For execution order, Celery Beat, and YAML groups, see Workflow.md. For the schedule → command wiring, see Development_guideline.md.
Schema vs flow: Schema.md documents tables and relationships per domain. This page complements it by showing cross-app data movement (ingest → PostgreSQL / workspace → Pinecone), which is easy to miss when reading ORM diagrams alone.
flowchart LR
subgraph sources [External_sources]
GH[GitHub_API]
SL[Slack_API]
DC[Discord]
ML[Mailing_lists]
YT[YouTube]
end
subgraph collectors [Django_apps_commands]
CMD[Management_commands]
end
subgraph storage [Local_and_DB]
WS[workspace_raw]
PG[(PostgreSQL)]
end
subgraph vectors [Search]
PC[cppa_pinecone_sync]
PIN[Pinecone]
end
GH --> CMD
SL --> CMD
DC --> CMD
ML --> CMD
YT --> CMD
CMD --> WS
CMD --> PG
PG --> PC
PC --> PIN
- Collectors are Django apps exposing
management/commands(scheduled viaboost_collector_runnerYAML + Celery Beat, or run manually withmanage.py). - Workspace holds clones, exports, and intermediate files under
WORKSPACE_DIR(see Workspace.md). - PostgreSQL is the system of record for ORM models across apps.
- cppa_pinecone_sync (and app-specific upsert paths) push embeddings/metadata to Pinecone namespaces.
Scheduled batch work is driven by boost_collector_runner reading config/boost_collector_schedule.yaml, which invokes run_scheduled_collectors → individual run_* management commands (one command per task line in the YAML).
flowchart TB
subgraph orch [Orchestration]
YAML[boost_collector_schedule.yaml]
RSC[run_scheduled_collectors]
end
YAML -.schedule.-> RSC
subgraph apps [Collector_commands_project_apps]
CMDS[Per_app_run_commands_in_INSTALLED_APPS]
end
RSC --> CMDS
CMDS --> PG[(PostgreSQL)]
CMDS --> WS[WORKSPACE_DIR]
CMDS --> PC[cppa_pinecone_sync]
PG --> PC
PC --> PIN[(Pinecone)]
Typical persistence (column below): where this app usually stores durable data during normal runs. PostgreSQL = Django ORM tables in the shared DB. WORKSPACE_DIR = files under the configured workspace root (workspace/<app>/… or workspace/raw/…; see Workspace.md). Pinecone = remote vector index. Apps may also call external APIs without listing them here (e.g. GitHub for pushes).
Django apps (project) — each row is an INSTALLED_APPS entry that participates in collection, sync, or shared identity (not an exhaustive command list; see each app’s management/commands/ and docs/README.md).
| Django app | Role (summary) | Typical persistence |
|---|---|---|
core |
Shared collector base types (CollectorBase, BaseCollectorCommand), structured errors, core.operations (GitHub/Slack I/O helpers — not a separate app) |
N/A (library code; core has no ORM models) |
boost_collector_runner |
YAML schedule resolution; run_scheduled_collectors entry point |
N/A (no ORM models) |
cppa_user_tracker |
Shared Slack/GitHub user identity used by trackers | PostgreSQL |
github_activity_tracker |
GitHub raw fetch utilities, workspace JSON, models used across GitHub pipelines | PostgreSQL, WORKSPACE_DIR |
boost_library_tracker |
Primary Boost org GitHub activity and library metadata | PostgreSQL, WORKSPACE_DIR |
boost_library_docs_tracker |
Boost documentation crawl / ingest | PostgreSQL, WORKSPACE_DIR |
boost_library_usage_dashboard |
Usage dashboard inputs | PostgreSQL, WORKSPACE_DIR (generated/export files) |
boost_usage_tracker |
Stars / content monitoring | PostgreSQL, WORKSPACE_DIR (CSV / staging paths) |
boost_mailing_list_tracker |
Mailing list archives | PostgreSQL, WORKSPACE_DIR (raw / message JSON) |
clang_github_tracker |
LLVM/Clang GitHub activity | PostgreSQL, WORKSPACE_DIR |
cppa_slack_tracker |
Slack messages and channels | PostgreSQL, WORKSPACE_DIR (per-channel JSON, raw) |
discord_activity_tracker |
Discord server activity | PostgreSQL, WORKSPACE_DIR |
cppa_youtube_script_tracker |
YouTube transcript collection | PostgreSQL, WORKSPACE_DIR (metadata, raw VTT) |
wg21_paper_tracker |
WG21 committee papers pipeline | PostgreSQL, WORKSPACE_DIR |
cppa_pinecone_sync |
Hybrid vector upserts / namespaces | PostgreSQL (sync status, fail lists), Pinecone |
slack_event_handler |
Slack Bolt (Socket Mode): huddles / PR bot — long-running process, not the same as the YAML nightly batch | WORKSPACE_DIR (JSON state / queue; no ORM models), GitHub (optional MD uploads) |
Pinecone paths: Many collectors write rows first; cppa_pinecone_sync (and some commands’ built-in sync phases) read from PostgreSQL and/or files and upsert into Pinecone. Namespace and field conventions vary by source; see Pinecone_preprocess_guideline.md and per-app docs under service_api/.