Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,16 @@ agency_identifier | Matches URLs with an agency from the PDAP database
annotation_pipeline | Automated pipeline for generating training data in our ML data source identification models. Manages common crawl, HTML tag collection, and Label Studio import/export
html_tag_collector | Collects HTML header, meta, and title tags and appends them to a JSON file. The idea is to make a richer dataset for algorithm training and data labeling.
identification_pipeline.py | The core python script uniting this modular pipeline. More details below.
openai-playground | Scripts for accessing the openai API on PDAP's shared account
llm_api_logic | Scripts for accessing the openai API on PDAP's shared account
source_collectors| Tools for extracting metadata from different sources, including CKAN data portals and Common Crawler
collector_db | Database for storing data from source collectors
collector_manager | A module which provides a unified interface for interacting with source collectors and relevant data
core | A module which integrates other components, such as collector_manager and collector_db
api | API for interacting with collector_manager, core, and collector_db
local_database | Resources for setting up a test database for local development
security_manager| A module which provides a unified interface for interacting with authentication and authorization |
tests | Unit and integration tests |
util | various utility functions |

## Installation

Expand All @@ -25,11 +28,12 @@ uv sync

## How to use

1. Create an .env file in this directory with these contents, or set the environment variable another way: `VUE_APP_PDAP_API_KEY=KeyGoesHere`
2. Create a file in this directory containing a list of urls to be identified, or modify the existing `urls.csv` file. This requires one URL per line with at least a `url` column.
3. Run `python3 identification_pipeline.py urls.csv`
4. Results will be written in the same directory as results.csv
5. If importing "identification_pipeline_main" function, it expects a dataframe as an argument and returns a resulting dataframe
1. Create an .env file in this directory following the instructions in `ENV.md`
1. If necessary, start up the database using `docker compose up -d` while in the `local_database` directory
2. Run `fastapi dev main.py` to start up the fast API server
3. In a browser, navigate to `http://localhost:8000/docs` to see the full list of API endpoints

Note that to access API endpoints, you will need to have a valid Bearer Token from the Data Sources API at `https://data-sources.pdap.io/api`

# Contributing

Expand Down