diff --git a/README.md b/README.md index 5a39d2bd..34211198 100644 --- a/README.md +++ b/README.md @@ -69,15 +69,15 @@ Be sure to inspect the `docker-compose.yml` file in the root directory -- some e ```mermaid flowchart TD - SourceCollectors["**Source Collectors:** automatic searches, citation follower, portal scrapers, agency crawlers, common crawler"] + SourceCollectors["**Source Collectors:** batches of potentially useful URLs created with a variety of strategies"] Logging["Logging source collection attempts"] API["Submitting sources to the **Data Sources API** for approval"] - Identifier["**Data Source Identifier:** agency matcher, duplicate checker, tag collector, ML metadata labelers"] - LabelStudio["Human labeling of missing or uncertain metadata in LabelStudio"] + Identifier["**Data Source Identifier:** automatically collecting metadata and attempting to identify properties"] + SourceCollectorLabeling["Human labeling of missing or uncertain metadata in Source Collector"] - Identifier --> LabelStudio + Identifier --> SourceCollectorLabeling Identifier ---> API - LabelStudio --> API + SourceCollectorLabeling --> API Identifier --> Logging SourceCollectors --> Identifier @@ -114,72 +114,31 @@ sequenceDiagram participant HF as Hugging Face participant GH as GitHub -participant LS as Label Studio +participant SC as Source Collector app participant PDAP as PDAP API loop create batches of URLs
for human labeling - GH ->> GH: Crawl for a new batch
of URLs with common_crawler
or other methods - GH ->> GH: Add metadata to each batch
with source_tag_collector - GH ->> LS: Add the batch as
labeling tasks in
the Label Studio project - LS -->> GH: Confirm batch created - GH ->> GH: add batches to a log file
in this repo with URL
and batch IDs -end + SC ->> SC: Crawl for a new batch
of URLs with common_crawler
or other methods + SC ->> SC: Add metadata to each batch
with source_tag_collector + SC ->> SC: Add labeling tasks in
the Source Collector app loop annotate URLs - LS ->> LS: Users annotate using
Label Studio interface + SC ->> SC: Users label using
Retool interface + SC ->> SC: Reviewers finalize
and submit labels end loop update training data
with new annotations - GH ->> LS: Check for completed
annotation tasks - LS -->> GH: Confirm new annotations
since last check - GH ->> HF: Write new annotations to
training-urls dataset - GH ->> GH: log batch status to file -end - -loop check PDAP database
for new sources - GH ->> PDAP: Trigger action to check
for new data sources - PDAP -->> GH: confirm sources available
since last check - GH ->> GH: Collect additional metadata - GH ->> HF: Write sources to
training dataset + SC ->> SC: Check for completed
annotation tasks + SC -->> PDAP: Submit labeled URLs to the app + SC ->> HF: Write all annotations to
training-urls dataset + SC ->> SC: maintain batch status end loop model training - GH ->> HF: retrain ML models with
updated data using
trainer in hugging_face + HF ->> HF: retrain ML models with
updated data using
trainer in hugging_face end -``` - -## Using trained models to identify URLs - -Each of these steps may be attempted with regex, human identification, or machine learning. We combine several machine learning (ML) models, each focusing on a specific task or property. - -```mermaid -%% Here's a guide to mermaid syntax: https://mermaid.js.org/syntax/flowchart.html - -sequenceDiagram - -participant HF as Hugging Face -participant GH as GitHub -participant PDAP as PDAP API - -GH ->> GH: Start with a batch of URLs from
common_crawler or another source
with a batch log file -GH ->> PDAP: Check for duplicate URLs -PDAP ->> GH: Report back duplicates to remove -GH ->> HF: Create batch for identification -HF -->> GH: Confirm batch created - -loop trigger Hugging Face models to add
labels to the same dataset - GH ->> HF: Check URLs for relevance
to police, courts, or jails - HF -->> GH: complete - GH ->> HF: Check relevant URLs for
"individual records" - HF -->> GH: complete - note over HF,GH: Ignore irrelevant and
individual record sources
for following steps - GH ->> HF: Identify an agency or
geographic area - GH ->> HF: Identify record_type,
name, and description - HF -->> GH: Confirm batch complete end - -GH ->> PDAP: Submit URLs for manual approval ``` # Docstring and Type Checking