Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
8828d7a
Continue draft
maxachis Oct 14, 2025
00248c4
Add schema column to `urls` table and update associated logic
maxachis Oct 14, 2025
16f2b66
Address alembic duplicates
maxachis Oct 14, 2025
2626db8
Merge pull request #490 from Police-Data-Accessibility-Project/mc_456…
maxachis Oct 14, 2025
dabc10b
Add logic for adding `updated_at` triggers, add triggers to relevant …
maxachis Oct 14, 2025
fdf40f9
Merge pull request #492 from Police-Data-Accessibility-Project/mc_491…
maxachis Oct 14, 2025
105eefa
Update root URL and redirect URL logic
maxachis Oct 18, 2025
fbda9c1
Merge pull request #497 from Police-Data-Accessibility-Project/mc_489…
maxachis Oct 18, 2025
6cf4e5f
Begin draft
maxachis Oct 18, 2025
4340d4a
Update URL Status View Enum
maxachis Oct 18, 2025
8c8e792
Merge pull request #501 from Police-Data-Accessibility-Project/mc_393…
maxachis Oct 18, 2025
d93d90a
Add agency endpoints
maxachis Oct 20, 2025
7c86759
Add URL suggestion endpoint
maxachis Oct 21, 2025
20568a4
Fix tests
maxachis Oct 21, 2025
dd341e2
Merge branch 'dev' into mc_494_suggest_data_source
maxachis Oct 21, 2025
1fe2235
Merge from dev
maxachis Oct 21, 2025
08c0c4a
Fix alembic migration bug
maxachis Oct 21, 2025
c9e8150
Merge pull request #499 from Police-Data-Accessibility-Project/mc_494…
maxachis Oct 21, 2025
4fd01c2
Have Request Models forbid extra parameters
maxachis Oct 21, 2025
67cd9ef
Merge pull request #505 from Police-Data-Accessibility-Project/mc_504…
maxachis Oct 21, 2025
f24f6c4
Add linking to batch logic, remove required user id for batches
maxachis Oct 21, 2025
560fb75
Merge pull request #507 from Police-Data-Accessibility-Project/mc_494…
maxachis Oct 21, 2025
83d458e
Add linking to batch logic, remove required user id for batches
maxachis Oct 24, 2025
fcd183d
Add meta url and data source endpoints
maxachis Oct 25, 2025
f541d6d
Merge pull request #508 from Police-Data-Accessibility-Project/mc_met…
maxachis Oct 25, 2025
2611109
Begin draft
maxachis Nov 9, 2025
8229199
Update draft
maxachis Nov 9, 2025
13e92ce
Complete pre-test draft
maxachis Nov 10, 2025
ce95750
Finish draft
maxachis Nov 14, 2025
b9dafff
Bump up max postgres connections
maxachis Nov 14, 2025
c4052a5
Fix connection creation leak
maxachis Nov 14, 2025
38210a6
Merge pull request #513 from Police-Data-Accessibility-Project/mc_ds_…
maxachis Nov 14, 2025
595f896
Add delete_url_ds_app_links
maxachis Nov 14, 2025
dd90f6c
Remove test logic from DatabaseClient
maxachis Nov 14, 2025
ab13ea8
Add sync loaders
maxachis Nov 14, 2025
e63b7ab
Fix misnamed env vars and refine task print logic
maxachis Nov 14, 2025
4b41860
Update PDAP Access Client
maxachis Nov 14, 2025
87146fd
Include URL status in Sync Content
maxachis Nov 15, 2025
ca26deb
Fix bug with access type
maxachis Nov 15, 2025
caaaa50
Fix bug with access type
maxachis Nov 15, 2025
7e3c2b1
Correct bug with Data Source Sync Delete pulling too many data sources
maxachis Nov 15, 2025
9179b84
Add per-request entity limit
maxachis Nov 15, 2025
e635323
Add README for synchronization logic.
maxachis Nov 15, 2025
7bc6348
Add README for synchronization logic.
maxachis Nov 15, 2025
08159e0
Set alter record formats and access types columns to be not null, def…
maxachis Nov 15, 2025
438d2bd
Merge pull request #517 from Police-Data-Accessibility-Project/mc_516…
maxachis Nov 15, 2025
ce27f50
Add check for duplicate entries
maxachis Nov 15, 2025
3b1feb3
Add condition to check for no extant URL Task Error
maxachis Nov 15, 2025
a952f1d
Fix bug in data sources and meta URL GET queries
maxachis Nov 15, 2025
35875dd
Merge pull request #520 from Police-Data-Accessibility-Project/mc_519…
maxachis Nov 15, 2025
94c32e3
Add `data-sources/:id` `GET` endpoint
maxachis Nov 16, 2025
b9805df
Merge pull request #521 from Police-Data-Accessibility-Project/mc_get…
maxachis Nov 16, 2025
9da9a8f
Add more strict requirements to record_formats, access_types, url_status
maxachis Nov 16, 2025
4f41bdd
Address bug in get user contributions
maxachis Nov 16, 2025
169037b
Add task logs to app syncs
maxachis Nov 16, 2025
dab483b
Merge pull request #522 from Police-Data-Accessibility-Project/mc_ds_…
maxachis Nov 16, 2025
37d5bf4
Add handling for when no results found
maxachis Nov 16, 2025
772ef34
Bump PDAP Access Manager to latest version
maxachis Nov 16, 2025
5f8248c
Continue draft
maxachis Nov 18, 2025
309b105
Remove unused import
maxachis Nov 18, 2025
968b064
Update number of entries
maxachis Nov 18, 2025
6b0fccc
Merge pull request #523 from Police-Data-Accessibility-Project/mc_822…
maxachis Nov 18, 2025
2cb3b53
Change namespace `source-manager` to `sync`
maxachis Nov 19, 2025
8920518
Merge pull request #524 from Police-Data-Accessibility-Project/mc_494…
maxachis Nov 19, 2025
bdcd211
Rename link tables
maxachis Nov 19, 2025
3e728ab
Merge pull request #525 from Police-Data-Accessibility-Project/mc_512…
maxachis Nov 19, 2025
67b4e47
Add descriptions to data source submissions
maxachis Nov 20, 2025
052343b
Continue draft
maxachis Nov 20, 2025
90ecbae
Add logic for handling duplicates in data source submission
maxachis Nov 20, 2025
6f8ed42
Merge pull request #526 from Police-Data-Accessibility-Project/mc_494…
maxachis Nov 20, 2025
fe67257
Begin draft
maxachis Nov 20, 2025
77c18c6
Add check/unique-url endpoint
maxachis Nov 20, 2025
0f3de3c
Migrate select endpoints
maxachis Nov 21, 2025
e63f53c
Merge pull request #529 from Police-Data-Accessibility-Project/mc_511…
maxachis Nov 21, 2025
e10624c
Remove URL Error Status
maxachis Nov 21, 2025
a5ff333
Merge pull request #530 from Police-Data-Accessibility-Project/mc_527…
maxachis Nov 21, 2025
89b0955
Change get Location Suggestions to return full display name
maxachis Nov 22, 2025
b6c9cf5
Update URL to set rows with `error` status to `ok`
maxachis Nov 23, 2025
1c4e373
Create Jenkinsfile
maxachis Nov 23, 2025
1816c4e
Remove dependencies from apply_migrations
maxachis Nov 23, 2025
0aef8d3
Update Record Task type not to repeat on error.
maxachis Nov 23, 2025
6613a7f
Merge pull request #532 from Police-Data-Accessibility-Project/mc_531…
maxachis Nov 23, 2025
bed6088
Add script for deleting hanging app links
maxachis Nov 23, 2025
0142cd7
Merge pull request #534 from Police-Data-Accessibility-Project/mc_533…
maxachis Nov 23, 2025
a0d0e5e
Begin draft
maxachis Nov 24, 2025
5ed52f3
Continue draft
maxachis Nov 24, 2025
5ffda47
Finish integrity monitor draft
maxachis Nov 24, 2025
62a7429
Merge pull request #536 from Police-Data-Accessibility-Project/mc_502…
maxachis Nov 24, 2025
09f7a77
Migrate sync information into main README
maxachis Nov 24, 2025
746acae
Merge pull request #537 from Police-Data-Accessibility-Project/mc_ds_…
maxachis Nov 24, 2025
5fd6904
Ensure consistent router capitalization
maxachis Nov 24, 2025
3ef9671
Merge pull request #538 from Police-Data-Accessibility-Project/mc_498…
maxachis Nov 24, 2025
a61ccd6
Adjust CKAN/Muckrock Agency ID Logic
maxachis Nov 27, 2025
2d32b57
Merge pull request #540 from Police-Data-Accessibility-Project/mc_405…
maxachis Nov 27, 2025
3ce5642
Add logic to prevent HTML content duplicates from being sent to Huggi…
maxachis Nov 27, 2025
f838cbc
Merge pull request #542 from Police-Data-Accessibility-Project/mc_371…
maxachis Nov 27, 2025
e8b0023
Add logic to prevent HTML content duplicates from being sent to Huggi…
maxachis Nov 27, 2025
6d79fb8
Change `UPDATE URL STATUS` task interval to daily
maxachis Nov 27, 2025
4ef61c6
Remove redundant ID columns
maxachis Nov 29, 2025
4e548e1
Merge pull request #545 from Police-Data-Accessibility-Project/mc_544…
maxachis Nov 29, 2025
9fe1818
Revise agency agreement logic
maxachis Nov 29, 2025
bda65ec
Merge pull request #546 from Police-Data-Accessibility-Project/mc_528…
maxachis Nov 29, 2025
9f3047e
Add CORSMiddleware
maxachis Nov 29, 2025
fa8359f
Revert AgencyIDSubtaskSuggestion
maxachis Nov 30, 2025
c2ddbe3
Remove adding primary key to duplicates (ironically)
maxachis Nov 30, 2025
a6f27db
Begin draft
maxachis Dec 1, 2025
6981bf9
Update annotations to join user and robo suggestions for locations an…
maxachis Dec 1, 2025
9d5e672
Merge pull request #548 from Police-Data-Accessibility-Project/mc_547…
maxachis Dec 1, 2025
3ed9106
Update source collector permission
maxachis Dec 1, 2025
a0c2dae
Fix bug where locations/agencies without annotations were being returned
maxachis Dec 1, 2025
00e5095
Add sorting for suggestions
maxachis Dec 1, 2025
a82fcf4
Add pagination for agency search
maxachis Dec 2, 2025
2868577
Merge pull request #552 from Police-Data-Accessibility-Project/mc_age…
maxachis Dec 2, 2025
a098291
Add sessions for anonymous annotations
maxachis Dec 5, 2025
3a876a7
Merge pull request #553 from Police-Data-Accessibility-Project/mc_434…
maxachis Dec 5, 2025
f41e095
Update annotation endpoints
maxachis Dec 8, 2025
b776754
Merge pull request #556 from Police-Data-Accessibility-Project/mc_ann…
maxachis Dec 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
39 changes: 24 additions & 15 deletions ENV.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,30 @@ Note that some tasks/subtasks are themselves enabled by other tasks.

### Scheduled Task Flags

| Flag | Description |
|-------------------------------------|-------------------------------------------------------------------------------|
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. Disabling disables all other scheduled tasks. |
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
| `POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG` | Populates the backlog snapshot. |
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
| `IA_SAVE_TASK_FLAG` | Saves URLs to Internet Archives. |
| `MARK_TASK_NEVER_COMPLETED_TASK_FLAG` | Marks tasks that were started but never completed (usually due to a restart). |
| `DELETE_STALE_SCREENSHOTS_TASK_FLAG` | Deletes stale screenshots for URLs already validated. |
| `TASK_CLEANUP_TASK_FLAG` | Cleans up tasks that are no longer needed. |
| `REFRESH_MATERIALIZED_VIEWS_TASK_FLAG` | Refreshes materialized views. |
| Flag | Description |
|--------------------------------------------|-------------------------------------------------------------------------------|
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. Disabling disables all other scheduled tasks. |
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
| `POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG` | Populates the backlog snapshot. |
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
| `IA_SAVE_TASK_FLAG` | Saves URLs to Internet Archives. |
| `MARK_TASK_NEVER_COMPLETED_TASK_FLAG` | Marks tasks that were started but never completed (usually due to a restart). |
| `DELETE_STALE_SCREENSHOTS_TASK_FLAG` | Deletes stale screenshots for URLs already validated. |
| `TASK_CLEANUP_TASK_FLAG` | Cleans up tasks that are no longer needed. |
| `REFRESH_MATERIALIZED_VIEWS_TASK_FLAG` | Refreshes materialized views. |
| `UPDATE_URL_STATUS_TASK_FLAG` | Updates the status of URLs. |
| `DS_APP_SYNC_AGENCY_ADD_TASK_FLAG` | Adds new agencies to the Data Sources App|
| `DS_APP_SYNC_AGENCY_UPDATE_TASK_FLAG` | Updates existing agencies in the Data Sources App|
| `DS_APP_SYNC_AGENCY_DELETE_TASK_FLAG` | Deletes agencies in the Data Sources App|
| `DS_APP_SYNC_DATA_SOURCE_ADD_TASK_FLAG` | Adds new data sources to the Data Sources App|
| `DS_APP_SYNC_DATA_SOURCE_UPDATE_TASK_FLAG` | Updates existing data sources in the Data Sources App|
| `DS_APP_SYNC_DATA_SOURCE_DELETE_TASK_FLAG` | Deletes data sources in the Data Sources App|
| `DS_APP_SYNC_META_URL_ADD_TASK_FLAG` | Adds new meta URLs to the Data Sources App|
| `DS_APP_SYNC_META_URL_UPDATE_TASK_FLAG` | Updates existing meta URLs in the Data Sources App|
| `DS_APP_SYNC_META_URL_DELETE_TASK_FLAG` | Deletes meta URLs in the Data Sources App|
| `INTEGRITY_MONITOR_TASK_FLAG` | Runs integrity checks. |

### URL Task Flags

Expand All @@ -81,7 +92,6 @@ URL Task Flags are collectively controlled by the `RUN_URL_TASKS_TASK_FLAG` flag
| `URL_HTML_TASK_FLAG` | URL HTML scraping task. |
| `URL_RECORD_TYPE_TASK_FLAG` | Automatically assigns Record Types to URLs. |
| `URL_AGENCY_IDENTIFICATION_TASK_FLAG` | Automatically assigns and suggests Agencies for URLs. |
| `URL_SUBMIT_APPROVED_TASK_FLAG` | Submits approved URLs to the Data Sources App. |
| `URL_MISC_METADATA_TASK_FLAG` | Adds misc metadata to URLs. |
| `URL_AUTO_RELEVANCE_TASK_FLAG` | Automatically assigns Relevances to URLs. |
| `URL_PROBE_TASK_FLAG` | Probes URLs for web metadata. |
Expand All @@ -90,7 +100,6 @@ URL Task Flags are collectively controlled by the `RUN_URL_TASKS_TASK_FLAG` flag
| `URL_AUTO_VALIDATE_TASK_FLAG` | Automatically validates URLs. |
| `URL_AUTO_NAME_TASK_FLAG` | Automatically names URLs. |
| `URL_SUSPEND_TASK_FLAG` | Suspends URLs meeting suspension criteria. |
| `URL_SUBMIT_META_URLS_TASK_FLAG` | Submits meta URLs to the Data Sources App. |

### Agency ID Subtasks

Expand Down
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,71 @@ if it detects any missing docstrings or type hints in files that you have modifi
These will *not* block any Pull request, but exist primarily as advisory comments to encourage good coding standards.

Note that `python_checks.yml` will only function on pull requests made from within the repo, not from a forked repo.

# Syncing to Data Sources App

The Source Manager (SM) is part of a two app system, with the other app being the Data Sources (DS) App.


## Add, Update, and Delete

These are the core synchronization actions.

In order to propagate changes to DS, we synchronize additions, updates, and deletions of the following entities:
- Agencies
- Data Sources
- Meta URLs

Each action for each entity occurs through a separate task. At the moment, there are nine tasks total.

Each task gathers requisite information from the SM database and sends a request to one of nine corresponding endpoints in the DS API.

Each DS endpoint follows the following format:

```text
/v3/sync/{entity}/{action}
```

Synchronizations are designed to occur on an hourly basis.

Here is a high-level description of how each action works:

### Add

Adds the given entities to DS.

These are denoted with the `/{entity}/add` path in the DS API.

When an entity is added, it returns a unique DS ID that is mapped to the internal SM database ID via the DS app link tables.

For an entity to be added, it must meet preconditions which are distinct for each entity:
- Agencies: Must have an agency entry in the database and be linked to a location.
- Data Sources: Must be a URL that has been internally validated as a data source and linked to an agency.
- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency.

### Update

Updates the given entities in DS.

These are denoted with the `/{entity}/update` path in the DS API.

These consist of submitting the updated entities (in full) to the requisite endpoint, and updating the local app link to indicate that the update occurred. All updates are designed to be full overwrites of the entity.

For an entity to be updated, it must meet preconditions which are distinct for each entity:
- Agencies: Must have either an agency row updated or an agency/location link updated or deleted.
- Data Sources: One of the following must be updated:
- The URL table
- The record type table
- The optional data sources metadata table
- The agency link table (either an addition or deletion)
- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency. Either the URL table or the agency link table (addition or deletion) must be updated.

### Delete

Deletes the given entities from DS.

These are denoted with the `/{entity}/delete` path in the DS API.

This consists of submitting a set of DS IDs to the requisite endpoint, and removing the associated DS app link entry in the SM database.

When an entity with a corresponding DS App Link is deleted from the Source Manager, the core data is removed but a deletion flag is appended to the DS App Link entry, indicating that the entry is not yet removed from the DS App. The deletion task uses this flag to identify entities to be deleted, submits the deletion request to the DS API, and removes both the flag and the DS App Link.
30 changes: 30 additions & 0 deletions alembic/Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
pipeline {
agent {
dockerfile {
filename 'Dockerfile'
args '-e POSTGRES_USER=POSTGRES_USER -e POSTGRES_PASSWORD=POSTGRES_PASSWORD -e POSTGRES_DB=POSTGRES_DB -e POSTGRES_HOST=POSTGRES_HOST -e POSTGRES_PORT=POSTGRES_PORT'
}
}

stages {
stage('Migrate using Alembic') {
steps {
echo 'Building..'
sh 'python apply_migrations.py'
}
}
}
post {
failure {
script {
def payload = """{
"content": "🚨 Build Failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
}"""

sh """
curl -X POST -H "Content-Type: application/json" -d '${payload}' ${env.WEBHOOK_URL}
"""
}
}
}
}
Loading