Rework Task Scheduling System

The current task scheduling system is fairly straightforward:

Every hour, the source identifier rotates through a suite of tasks: HTML scraping, automatic relevancy labeling, and so on. 

For each task, the app will run a query against the database to see if preconditions are met for the task to be engaged in (i.e., for HTML scraping, whether there is a pending URL where HTML scraping has not yet been attempted).

If the precondition is met, the task then operates.

This works for taking care of the tasks _eventually_, but this runs into a few problems if we initiate a batch collection process and then want to immediately begin annotating data. Currently, after a batch finishes collecting data, it can take up to an hour for various tasks on the data to be performed. And tasks operate on collections of 100 URLs at a time, so if there are more than 100 URLs in a batch, it could take several hours before they're all processed. 

There are ways to optimize the process so that data is processed more quickly, but it will take some thinking on the design end and then some implementation. That's what this issue is about. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Task Scheduling System #204

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rework Task Scheduling System #204

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions