The current task scheduling system is fairly straightforward:
Every hour, the source identifier rotates through a suite of tasks: HTML scraping, automatic relevancy labeling, and so on.
For each task, the app will run a query against the database to see if preconditions are met for the task to be engaged in (i.e., for HTML scraping, whether there is a pending URL where HTML scraping has not yet been attempted).
If the precondition is met, the task then operates.
This works for taking care of the tasks eventually, but this runs into a few problems if we initiate a batch collection process and then want to immediately begin annotating data. Currently, after a batch finishes collecting data, it can take up to an hour for various tasks on the data to be performed. And tasks operate on collections of 100 URLs at a time, so if there are more than 100 URLs in a batch, it could take several hours before they're all processed.
There are ways to optimize the process so that data is processed more quickly, but it will take some thinking on the design end and then some implementation. That's what this issue is about.
The current task scheduling system is fairly straightforward:
Every hour, the source identifier rotates through a suite of tasks: HTML scraping, automatic relevancy labeling, and so on.
For each task, the app will run a query against the database to see if preconditions are met for the task to be engaged in (i.e., for HTML scraping, whether there is a pending URL where HTML scraping has not yet been attempted).
If the precondition is met, the task then operates.
This works for taking care of the tasks eventually, but this runs into a few problems if we initiate a batch collection process and then want to immediately begin annotating data. Currently, after a batch finishes collecting data, it can take up to an hour for various tasks on the data to be performed. And tasks operate on collections of 100 URLs at a time, so if there are more than 100 URLs in a batch, it could take several hours before they're all processed.
There are ways to optimize the process so that data is processed more quickly, but it will take some thinking on the design end and then some implementation. That's what this issue is about.