Skip to content

WB Data Catalog v2: AI-powered data catalog for BigQuery datasets#421

Draft
vrajat44 wants to merge 3 commits into
masterfrom
wb-data-catalog-v2
Draft

WB Data Catalog v2: AI-powered data catalog for BigQuery datasets#421
vrajat44 wants to merge 3 commits into
masterfrom
wb-data-catalog-v2

Conversation

@vrajat44
Copy link
Copy Markdown

@vrajat44 vrajat44 commented May 25, 2026

Summary

Self-service data catalog for BigQuery datasets on Verily Workbench. Browse datasets, profile tables with AI-generated metadata, explore data visually, build cohorts, and query with natural language.

Features

  • Data Catalog — progressive dataset/table browsing with profiling status badges
  • Technical Profiling — column stats, null rates, distinct counts, patterns (pure BigQuery)
  • Semantic Profiling — AI-generated business names, definitions, HIPAA sensitivity, terminology bindings, join paths, cohort dimensions (Gemini)
  • Profiling Wizard — guided 4-step workflow: select tables → configure domain/docs → review → live progress
  • Key Insights — AI-suggested chart visualizations from profile data
  • Interactive Explorer — drag-and-drop visual analytics (Graphic Walker)
  • Terminology Registry — cross-table standardized codes (LOINC, SNOMED, ICD-10, RxNorm, CPT)
  • Cohort Builder — three-tab interface (Table Filters, Terminology, Natural Language) with expanded operators
  • Data AMA Agent — chat with Q&A mode (metadata) and Agent mode (SQL generation + execution)
  • Settings — workspace picker, custom GCP project, model selection, profile reset

Install

Field Value
App name WB Data Catalog v2
Repository URL https://github.com/vrajat44/workbench-app-devcontainers.git
Repository branch wb-data-catalog-v2
Repository folder path src/wb-data-catalog-v2
Machine type e2-standard-4 (4 vCPU, 16 GB) or higher

See INSTALL_GUIDE.md for full instructions (Workbench + local dev setup).

Architecture

Layer Stack
Backend FastAPI + Python 3.11, BigQuery, GCS, Vertex AI (Gemini)
Frontend Vite + React + TypeScript + Recharts + Graphic Walker
Profiling verily_profiler/ — technical stats + semantic LLM profiling
Chat verily_chat/ — LangGraph agent with BigQuery tools
Deploy Multi-stage Docker (Node 20 + Python 3.11), devcontainer pattern

What's in this PR

  • Complete app under src/wb-data-catalog-v2/
  • Multi-stage Dockerfile + docker-compose.yaml + devcontainer.json + start.sh
  • 8 backend modules, 15+ frontend components, full API surface
  • Install guide (Workbench deploy + local dev), usage guide, README

🤖 Generated with Claude Code

vrajat44 and others added 3 commits May 8, 2026 10:58
React + FastAPI data catalog with technical and semantic profiling,
terminology registry, three-mode cohort builder (table filters,
terminology, natural language), and chat agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant