Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 20 additions & 19 deletions ENV.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,26 @@ This page provides a full list, with description, of all the environment variabl

Please ensure these are properly defined in a `.env` file in the root directory.

| Name | Description | Example |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
| `GOOGLE_API_KEY` | The API key required for accessing the Google Custom Search API | `abc123` |
| `GOOGLE_CSE_ID` | The CSE ID required for accessing the Google Custom Search API | `abc123` |
|`POSTGRES_USER` | The username for the test database | `test_source_collector_user` |
|`POSTGRES_PASSWORD` | The password for the test database | `HanviliciousHamiltonHilltops` |
|`POSTGRES_DB` | The database name for the test database | `source_collector_test_db` |
|`POSTGRES_HOST` | The host for the test database | `127.0.0.1` |
|`POSTGRES_PORT` | The port for the test database | `5432` |
|`DS_APP_SECRET_KEY`| The secret key used for decoding JWT tokens produced by the Data Sources App. Must match the secret token `JWT_SECRET_KEY` that is used in the Data Sources App for encoding. | `abc123` |
|`DEV`| Set to any value to run the application in development mode. | `true` |
|`DEEPSEEK_API_KEY`| The API key required for accessing the DeepSeek API. | `abc123` |
|`OPENAI_API_KEY`| The API key required for accessing the OpenAI API. | `abc123` |
|`PDAP_EMAIL`| An email address for accessing the PDAP API.[^1] | `abc123@test.com` |
|`PDAP_PASSWORD`| A password for accessing the PDAP API.[^1] | `abc123` |
|`PDAP_API_KEY`| An API key for accessing the PDAP API. | `abc123` |
|`PDAP_API_URL`| The URL for the PDAP API| `https://data-sources-v2.pdap.dev/api`|
|`DISCORD_WEBHOOK_URL`| The URL for the Discord webhook used for notifications| `abc123` |
|`HUGGINGFACE_INFERENCE_API_KEY` | The API key required for accessing the Huggingface Inference API. | `abc123` |
| Name | Description | Example |
|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| `GOOGLE_API_KEY` | The API key required for accessing the Google Custom Search API | `abc123` |
| `GOOGLE_CSE_ID` | The CSE ID required for accessing the Google Custom Search API | `abc123` |
| `POSTGRES_USER` | The username for the test database | `test_source_collector_user` |
| `POSTGRES_PASSWORD` | The password for the test database | `HanviliciousHamiltonHilltops` |
| `POSTGRES_DB` | The database name for the test database | `source_collector_test_db` |
| `POSTGRES_HOST` | The host for the test database | `127.0.0.1` |
| `POSTGRES_PORT` | The port for the test database | `5432` |
| `DS_APP_SECRET_KEY` | The secret key used for decoding JWT tokens produced by the Data Sources App. Must match the secret token `JWT_SECRET_KEY` that is used in the Data Sources App for encoding. | `abc123` |
| `DEV` | Set to any value to run the application in development mode. | `true` |
| `DEEPSEEK_API_KEY` | The API key required for accessing the DeepSeek API. | `abc123` |
| `OPENAI_API_KEY` | The API key required for accessing the OpenAI API. | `abc123` |
| `PDAP_EMAIL` | An email address for accessing the PDAP API.[^1] | `abc123@test.com` |
| `PDAP_PASSWORD` | A password for accessing the PDAP API.[^1] | `abc123` |
| `PDAP_API_KEY` | An API key for accessing the PDAP API. | `abc123` |
| `PDAP_API_URL` | The URL for the PDAP API | `https://data-sources-v2.pdap.dev/api` |
| `DISCORD_WEBHOOK_URL` | The URL for the Discord webhook used for notifications | `abc123` |
| `HUGGINGFACE_INFERENCE_API_KEY` | The API key required for accessing the Hugging Face Inference API. | `abc123` |
| `HUGGINGFACE_HUB_TOKEN` | `abc123` | The API key required for uploading to the PDAP HuggingFace account via Hugging Face Hub API. |

[^1:] The user account in question will require elevated permissions to access certain endpoints. At a minimum, the user will require the `source_collector` and `db_write` permissions.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
"""Setup for upload to huggingface task

Revision ID: 637de6eaa3ab
Revises: 59d2af1bab33
Create Date: 2025-07-26 08:30:37.940091

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa

from src.util.alembic_helpers import id_column, switch_enum_type

# revision identifiers, used by Alembic.
revision: str = '637de6eaa3ab'
down_revision: Union[str, None] = '59d2af1bab33'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None

TABLE_NAME = "huggingface_upload_state"


def upgrade() -> None:

Check warning on line 24 in alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py#L24 <103>

Missing docstring in public function
Raw output
./alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py:24:1: D103 Missing docstring in public function
op.create_table(
TABLE_NAME,
id_column(),
sa.Column(
"last_upload_at",
sa.DateTime(),
nullable=False
),
)

switch_enum_type(
table_name='tasks',
column_name='task_type',
enum_name='task_type',
new_enum_values=[
'HTML',
'Relevancy',
'Record Type',
'Agency Identification',
'Misc Metadata',
'Submit Approved URLs',
'Duplicate Detection',
'404 Probe',
'Sync Agencies',
'Sync Data Sources',
'Push to Hugging Face'
]
)


def downgrade() -> None:

Check warning on line 55 in alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py#L55 <103>

Missing docstring in public function
Raw output
./alembic/versions/2025_07_26_0830-637de6eaa3ab_setup_for_upload_to_huggingface_task.py:55:1: D103 Missing docstring in public function
op.drop_table(TABLE_NAME)

switch_enum_type(
table_name='tasks',
column_name='task_type',
enum_name='task_type',
new_enum_values=[
'HTML',
'Relevancy',
'Record Type',
'Agency Identification',
'Misc Metadata',
'Submit Approved URLs',
'Duplicate Detection',
'404 Probe',
'Sync Agencies',
'Sync Data Sources'
]
)
6 changes: 5 additions & 1 deletion src/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from src.db.client.async_ import AsyncDatabaseClient
from src.db.client.sync import DatabaseClient
from src.core.tasks.url.operators.url_html.scraper.root_url_cache.core import RootURLCache
from src.external.huggingface.hub.client import HuggingFaceHubClient
from src.external.huggingface.inference.client import HuggingFaceInferenceClient
from src.external.pdap.client import PDAPClient

Expand Down Expand Up @@ -101,7 +102,10 @@ async def lifespan(app: FastAPI):
handler=task_handler,
loader=ScheduledTaskOperatorLoader(
adb_client=adb_client,
pdap_client=pdap_client
pdap_client=pdap_client,
hf_client=HuggingFaceHubClient(
token=env_var_manager.hf_hub_token
)
)
)
await async_scheduled_task_manager.setup()
Expand Down
1 change: 1 addition & 0 deletions src/core/env_var_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ def _load(self):

self.openai_api_key = self.require_env("OPENAI_API_KEY")
self.hf_inference_api_key = self.require_env("HUGGINGFACE_INFERENCE_API_KEY")
self.hf_hub_token = self.require_env("HUGGINGFACE_HUB_TOKEN")

self.postgres_user = self.require_env("POSTGRES_USER")
self.postgres_password = self.require_env("POSTGRES_PASSWORD")
Expand Down
36 changes: 36 additions & 0 deletions src/core/tasks/scheduled/huggingface/operator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@

Check warning on line 1 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:1:1: D100 Missing docstring in public module
from src.core.tasks.scheduled.templates.operator import ScheduledTaskOperatorBase
from src.db.client.async_ import AsyncDatabaseClient
from src.db.enums import TaskType
from src.external.huggingface.hub.client import HuggingFaceHubClient


class PushToHuggingFaceTaskOperator(ScheduledTaskOperatorBase):

Check warning on line 8 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L8 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:8:1: D101 Missing docstring in public class

@property
def task_type(self) -> TaskType:

Check warning on line 11 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L11 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:11:1: D102 Missing docstring in public method
return TaskType.PUSH_TO_HUGGINGFACE

def __init__(

Check warning on line 14 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L14 <107>

Missing docstring in __init__
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:14:1: D107 Missing docstring in __init__
self,
adb_client: AsyncDatabaseClient,
hf_client: HuggingFaceHubClient
):
super().__init__(adb_client)
self.hf_client = hf_client

async def inner_task_logic(self):

Check warning on line 22 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L22 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:22:1: D102 Missing docstring in public method
# Check if any valid urls have been updated
valid_urls_updated = await self.adb_client.check_valid_urls_updated()
print(f"Valid urls updated: {valid_urls_updated}")
if not valid_urls_updated:
print("No valid urls updated, skipping.")
return


# Otherwise, push to huggingface

Check failure on line 31 in src/core/tasks/scheduled/huggingface/operator.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/operator.py#L31 <303>

too many blank lines (2)
Raw output
./src/core/tasks/scheduled/huggingface/operator.py:31:9: E303 too many blank lines (2)
run_dt = await self.adb_client.get_current_database_time()
outputs = await self.adb_client.get_data_sources_raw_for_huggingface()
self.hf_client.push_data_sources_raw_to_hub(outputs)

await self.adb_client.set_hugging_face_upload_state(run_dt.replace(tzinfo=None))
Empty file.
Empty file.
14 changes: 14 additions & 0 deletions src/core/tasks/scheduled/huggingface/queries/check/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from sqlalchemy.ext.asyncio import AsyncSession

Check warning on line 1 in src/core/tasks/scheduled/huggingface/queries/check/core.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/core.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/core.py:1:1: D100 Missing docstring in public module

from src.core.tasks.scheduled.huggingface.queries.check.requester import CheckValidURLsUpdatedRequester
from src.db.queries.base.builder import QueryBuilderBase


class CheckValidURLsUpdatedQueryBuilder(QueryBuilderBase):

Check warning on line 7 in src/core/tasks/scheduled/huggingface/queries/check/core.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/core.py#L7 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/core.py:7:1: D101 Missing docstring in public class

async def run(self, session: AsyncSession) -> bool:

Check warning on line 9 in src/core/tasks/scheduled/huggingface/queries/check/core.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/core.py#L9 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/core.py:9:1: D102 Missing docstring in public method
requester = CheckValidURLsUpdatedRequester(session=session)
latest_upload = await requester.latest_upload()
return await requester.has_valid_urls(latest_upload)


Check warning on line 14 in src/core/tasks/scheduled/huggingface/queries/check/core.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/core.py#L14 <391>

blank line at end of file
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/core.py:14:1: W391 blank line at end of file
53 changes: 53 additions & 0 deletions src/core/tasks/scheduled/huggingface/queries/check/requester.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from datetime import datetime

Check warning on line 1 in src/core/tasks/scheduled/huggingface/queries/check/requester.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/requester.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/requester.py:1:1: D100 Missing docstring in public module

from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.sql.functions import count

from src.collectors.enums import URLStatus
from src.db.helpers.session import session_helper as sh
from src.db.models.instantiations.state.huggingface import HuggingFaceUploadState
from src.db.models.instantiations.url.compressed_html import URLCompressedHTML
from src.db.models.instantiations.url.core.sqlalchemy import URL


class CheckValidURLsUpdatedRequester:

Check warning on line 14 in src/core/tasks/scheduled/huggingface/queries/check/requester.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/requester.py#L14 <101>

Missing docstring in public class
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/requester.py:14:1: D101 Missing docstring in public class

def __init__(self, session: AsyncSession):

Check warning on line 16 in src/core/tasks/scheduled/huggingface/queries/check/requester.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/requester.py#L16 <107>

Missing docstring in __init__
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/requester.py:16:1: D107 Missing docstring in __init__
self.session = session

async def latest_upload(self) -> datetime:

Check warning on line 19 in src/core/tasks/scheduled/huggingface/queries/check/requester.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/requester.py#L19 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/requester.py:19:1: D102 Missing docstring in public method
query = (
select(
HuggingFaceUploadState.last_upload_at
)
)
return await sh.scalar(
session=self.session,
query=query
)

async def has_valid_urls(self, last_upload_at: datetime | None) -> bool:

Check warning on line 30 in src/core/tasks/scheduled/huggingface/queries/check/requester.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/check/requester.py#L30 <102>

Missing docstring in public method
Raw output
./src/core/tasks/scheduled/huggingface/queries/check/requester.py:30:1: D102 Missing docstring in public method
query = (
select(count(URL.id))
.join(
URLCompressedHTML,
URL.id == URLCompressedHTML.url_id
)
.where(
URL.outcome.in_(
[
URLStatus.VALIDATED,
URLStatus.NOT_RELEVANT.value,
URLStatus.SUBMITTED.value,
]
),
)
)
if last_upload_at is not None:
query = query.where(URL.updated_at > last_upload_at)
url_count = await sh.scalar(
session=self.session,
query=query
)
return url_count > 0
Empty file.
16 changes: 16 additions & 0 deletions src/core/tasks/scheduled/huggingface/queries/get/convert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from src.collectors.enums import URLStatus

Check warning on line 1 in src/core/tasks/scheduled/huggingface/queries/get/convert.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/get/convert.py#L1 <100>

Missing docstring in public module
Raw output
./src/core/tasks/scheduled/huggingface/queries/get/convert.py:1:1: D100 Missing docstring in public module
from src.core.enums import RecordType
from src.core.tasks.scheduled.huggingface.queries.get.enums import RecordTypeCoarse
from src.core.tasks.scheduled.huggingface.queries.get.mappings import FINE_COARSE_RECORD_TYPE_MAPPING, \
OUTCOME_RELEVANCY_MAPPING


def convert_fine_to_coarse_record_type(

Check warning on line 8 in src/core/tasks/scheduled/huggingface/queries/get/convert.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/get/convert.py#L8 <103>

Missing docstring in public function
Raw output
./src/core/tasks/scheduled/huggingface/queries/get/convert.py:8:1: D103 Missing docstring in public function
fine_record_type: RecordType
) -> RecordTypeCoarse:
return FINE_COARSE_RECORD_TYPE_MAPPING[fine_record_type]

def convert_url_status_to_relevant(

Check warning on line 13 in src/core/tasks/scheduled/huggingface/queries/get/convert.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/get/convert.py#L13 <103>

Missing docstring in public function
Raw output
./src/core/tasks/scheduled/huggingface/queries/get/convert.py:13:1: D103 Missing docstring in public function
url_status: URLStatus
) -> bool:
return OUTCOME_RELEVANCY_MAPPING[url_status]

Check warning on line 16 in src/core/tasks/scheduled/huggingface/queries/get/convert.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] src/core/tasks/scheduled/huggingface/queries/get/convert.py#L16 <292>

no newline at end of file
Raw output
./src/core/tasks/scheduled/huggingface/queries/get/convert.py:16:49: W292 no newline at end of file
Loading