Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,6 @@ COPY collector_db ./collector_db
COPY collector_manager ./collector_manager
COPY core ./core
COPY html_tag_collector ./html_tag_collector
COPY hugging_face/url_relevance ./hugging_face/url_relevance
COPY hugging_face/url_record_type_labeling ./hugging_face/url_record_type_labeling
COPY hugging_face/HuggingFaceInterface.py ./hugging_face/HuggingFaceInterface.py
COPY source_collectors ./source_collectors
COPY util ./util
COPY alembic.ini ./alembic.ini
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ name | description of purpose
agency_identifier | Matches URLs with an agency from the PDAP database
annotation_pipeline | Automated pipeline for generating training data in our ML data source identification models. Manages common crawl, HTML tag collection, and Label Studio import/export
html_tag_collector | Collects HTML header, meta, and title tags and appends them to a JSON file. The idea is to make a richer dataset for algorithm training and data labeling.
hugging_face | Utilities for interacting with our machine learning space at [Hugging Face](https://huggingface.co/PDAP)
identification_pipeline.py | The core python script uniting this modular pipeline. More details below.
openai-playground | Scripts for accessing the openai API on PDAP's shared account
source_collectors| Tools for extracting metadata from different sources, including CKAN data portals and Common Crawler
Expand Down
2 changes: 0 additions & 2 deletions api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@
from html_tag_collector.ResponseParser import HTMLResponseParser
from html_tag_collector.RootURLCache import RootURLCache
from html_tag_collector.URLRequestInterface import URLRequestInterface
from hugging_face.HuggingFaceInterface import HuggingFaceInterface
from pdap_access_manager import AccessManager
from pdap_api_client.PDAPClient import PDAPClient
from util.DiscordNotifier import DiscordPoster
Expand Down Expand Up @@ -54,7 +53,6 @@
)
task_manager = TaskManager(
adb_client=adb_client,
huggingface_interface=HuggingFaceInterface(),
url_request_interface=URLRequestInterface(),
html_parser=HTMLResponseParser(
root_url_cache=RootURLCache()
Expand Down Expand Up @@ -118,7 +116,7 @@
)

@app.get("/docs", include_in_schema=False)
async def redirect_docs():

Check warning on line 119 in api/main.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] api/main.py#L119 <103>

Missing docstring in public function
Raw output
./api/main.py:119:1: D103 Missing docstring in public function
return RedirectResponse(url="/api")


Expand All @@ -139,3 +137,3 @@


if __name__ == "__main__":
Expand Down
10 changes: 0 additions & 10 deletions core/DTOs/task_data_objects/URLRelevanceHuggingfaceTDO.py

This file was deleted.

14 changes: 1 addition & 13 deletions core/TaskManager.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import logging

Check warning on line 1 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L1 <100>

Missing docstring in public module
Raw output
./core/TaskManager.py:1:1: D100 Missing docstring in public module

from core.classes.task_operators.URL404ProbeTaskOperator import URL404ProbeTaskOperator
from core.classes.task_operators.URLDuplicateTaskOperator import URLDuplicateTaskOperator
Expand All @@ -15,32 +15,28 @@
from core.classes.task_operators.URLHTMLTaskOperator import URLHTMLTaskOperator
from core.classes.task_operators.URLMiscellaneousMetadataTaskOperator import URLMiscellaneousMetadataTaskOperator
from core.classes.task_operators.URLRecordTypeTaskOperator import URLRecordTypeTaskOperator
from core.classes.task_operators.URLRelevanceHuggingfaceTaskOperator import URLRelevanceHuggingfaceTaskOperator
from core.enums import BatchStatus
from html_tag_collector.ResponseParser import HTMLResponseParser
from html_tag_collector.URLRequestInterface import URLRequestInterface
from hugging_face.HuggingFaceInterface import HuggingFaceInterface
from llm_api_logic.OpenAIRecordClassifier import OpenAIRecordClassifier
from pdap_api_client.PDAPClient import PDAPClient
from util.DiscordNotifier import DiscordPoster

TASK_REPEAT_THRESHOLD = 20

class TaskManager:

Check warning on line 27 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L27 <101>

Missing docstring in public class
Raw output
./core/TaskManager.py:27:1: D101 Missing docstring in public class

def __init__(

Check warning on line 29 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L29 <107>

Missing docstring in __init__
Raw output
./core/TaskManager.py:29:1: D107 Missing docstring in __init__
self,
adb_client: AsyncDatabaseClient,
huggingface_interface: HuggingFaceInterface,
url_request_interface: URLRequestInterface,
html_parser: HTMLResponseParser,
discord_poster: DiscordPoster,
pdap_client: PDAPClient
pdap_client: PDAPClient,
):
# Dependencies
self.adb_client = adb_client
self.pdap_client = pdap_client
self.huggingface_interface = huggingface_interface
self.url_request_interface = url_request_interface
self.html_parser = html_parser
self.discord_poster = discord_poster
Expand All @@ -53,8 +49,8 @@



#region Task Operators

Check failure on line 52 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L52 <303>

too many blank lines (3)
Raw output
./core/TaskManager.py:52:5: E303 too many blank lines (3)

Check failure on line 52 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L52 <265>

block comment should start with '# '
Raw output
./core/TaskManager.py:52:5: E265 block comment should start with '# '
async def get_url_html_task_operator(self):

Check warning on line 53 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L53 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:53:1: D102 Missing docstring in public method

Check failure on line 53 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L53 <301>

expected 1 blank line, found 0
Raw output
./core/TaskManager.py:53:5: E301 expected 1 blank line, found 0
operator = URLHTMLTaskOperator(
adb_client=self.adb_client,
url_request_interface=self.url_request_interface,
Expand All @@ -62,21 +58,14 @@
)
return operator

async def get_url_relevance_huggingface_task_operator(self):
operator = URLRelevanceHuggingfaceTaskOperator(
adb_client=self.adb_client,
huggingface_interface=self.huggingface_interface
)
return operator

async def get_url_record_type_task_operator(self):

Check warning on line 61 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L61 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:61:1: D102 Missing docstring in public method
operator = URLRecordTypeTaskOperator(
adb_client=self.adb_client,
classifier=OpenAIRecordClassifier()
)
return operator

async def get_agency_identification_task_operator(self):

Check warning on line 68 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L68 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:68:1: D102 Missing docstring in public method
muckrock_api_interface = MuckrockAPIInterface()
operator = AgencyIdentificationTaskOperator(
adb_client=self.adb_client,
Expand All @@ -85,52 +74,51 @@
)
return operator

async def get_submit_approved_url_task_operator(self):

Check warning on line 77 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L77 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:77:1: D102 Missing docstring in public method
operator = SubmitApprovedURLTaskOperator(
adb_client=self.adb_client,
pdap_client=self.pdap_client
)
return operator

async def get_url_miscellaneous_metadata_task_operator(self):

Check warning on line 84 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L84 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:84:1: D102 Missing docstring in public method
operator = URLMiscellaneousMetadataTaskOperator(
adb_client=self.adb_client
)
return operator

async def get_url_duplicate_task_operator(self):

Check warning on line 90 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L90 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:90:1: D102 Missing docstring in public method
operator = URLDuplicateTaskOperator(
adb_client=self.adb_client,
pdap_client=self.pdap_client
)
return operator

async def get_url_404_probe_task_operator(self):

Check warning on line 97 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L97 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:97:1: D102 Missing docstring in public method
operator = URL404ProbeTaskOperator(
adb_client=self.adb_client,
url_request_interface=self.url_request_interface
)
return operator

async def get_task_operators(self) -> list[TaskOperatorBase]:

Check warning on line 104 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L104 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:104:1: D102 Missing docstring in public method
return [
await self.get_url_html_task_operator(),
await self.get_url_duplicate_task_operator(),
await self.get_url_404_probe_task_operator(),
# await self.get_url_relevance_huggingface_task_operator(),
await self.get_url_record_type_task_operator(),
await self.get_agency_identification_task_operator(),
await self.get_url_miscellaneous_metadata_task_operator(),
await self.get_submit_approved_url_task_operator()
]

#endregion

Check failure on line 115 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L115 <265>

block comment should start with '# '
Raw output
./core/TaskManager.py:115:5: E265 block comment should start with '# '

#region Tasks

Check failure on line 117 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L117 <265>

block comment should start with '# '
Raw output
./core/TaskManager.py:117:5: E265 block comment should start with '# '
async def set_task_status(self, task_type: TaskType):

Check warning on line 118 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L118 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:118:1: D102 Missing docstring in public method
self.task_status = task_type

async def run_tasks(self):

Check warning on line 121 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L121 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:121:1: D102 Missing docstring in public method
operators = await self.get_task_operators()
for operator in operators:
count = 0
Expand All @@ -153,23 +141,23 @@
meets_prereq = await operator.meets_task_prerequisites()
await self.set_task_status(task_type=TaskType.IDLE)

async def trigger_task_run(self):

Check warning on line 144 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L144 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:144:1: D102 Missing docstring in public method
await self.task_trigger.trigger_or_rerun()


async def conclude_task(self, run_info):

Check warning on line 148 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L148 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:148:1: D102 Missing docstring in public method

Check failure on line 148 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L148 <303>

too many blank lines (2)
Raw output
./core/TaskManager.py:148:5: E303 too many blank lines (2)
await self.adb_client.link_urls_to_task(
task_id=run_info.task_id,
url_ids=run_info.linked_url_ids
)
await self.handle_outcome(run_info)

async def initiate_task_in_db(self, task_type: TaskType) -> int:

Check warning on line 155 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L155 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:155:1: D102 Missing docstring in public method
self.logger.info(f"Initiating {task_type.value} Task")
task_id = await self.adb_client.initiate_task(task_type=task_type)
return task_id

async def handle_outcome(self, run_info: TaskOperatorRunInfo):

Check warning on line 160 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L160 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:160:1: D102 Missing docstring in public method
match run_info.outcome:
case TaskOperatorOutcome.ERROR:
await self.handle_task_error(run_info)
Expand All @@ -179,7 +167,7 @@
status=BatchStatus.READY_TO_LABEL
)

async def handle_task_error(self, run_info: TaskOperatorRunInfo):

Check warning on line 170 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L170 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:170:1: D102 Missing docstring in public method
await self.adb_client.update_task_status(
task_id=run_info.task_id,
status=BatchStatus.ERROR)
Expand All @@ -190,10 +178,10 @@
await self.discord_poster.post_to_discord(
message=f"Task {run_info.task_id} ({self.task_status.value}) failed with error.")

async def get_task_info(self, task_id: int) -> TaskInfo:

Check warning on line 181 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L181 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:181:1: D102 Missing docstring in public method
return await self.adb_client.get_task_info(task_id=task_id)

async def get_tasks(

Check warning on line 184 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L184 <102>

Missing docstring in public method
Raw output
./core/TaskManager.py:184:1: D102 Missing docstring in public method
self,
page: int,
task_type: TaskType,
Expand All @@ -206,7 +194,7 @@
)


#endregion

Check failure on line 197 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L197 <303>

too many blank lines (2)
Raw output
./core/TaskManager.py:197:5: E303 too many blank lines (2)

Check failure on line 197 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L197 <265>

block comment should start with '# '
Raw output
./core/TaskManager.py:197:5: E265 block comment should start with '# '



Check warning on line 200 in core/TaskManager.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] core/TaskManager.py#L200 <391>

blank line at end of file
Raw output
./core/TaskManager.py:200:1: W391 blank line at end of file

This file was deleted.

40 changes: 0 additions & 40 deletions hugging_face/HuggingFaceInterface.py

This file was deleted.

58 changes: 0 additions & 58 deletions hugging_face/README.md

This file was deleted.

Empty file removed hugging_face/__init__.py
Empty file.
Empty file removed hugging_face/example/__init__.py
Empty file.
53 changes: 0 additions & 53 deletions hugging_face/example/huggingface_test.py

This file was deleted.

28 changes: 0 additions & 28 deletions hugging_face/example/labels.txt

This file was deleted.

43 changes: 0 additions & 43 deletions hugging_face/example/split_data.py

This file was deleted.

Loading
Loading