This script processes files recursively from a specified directory using an API, logs results in a local SQLite database, and provides options for retrying failed or pending files. It includes features for skipping specific files, generating reports, and running multiple API calls in parallel.
- Parallel Processing: Process files in parallel, with the number of parallel calls configurable.
- Status Tracking: Tracks the execution status, results, and time taken for each file in an SQLite database.
- Retry Logic: Options to retry failed or pending files, or to skip them.
- Detailed Reporting: Prints a summary of file processing and provides a detailed report.
- Polling: Polls the API until the result is complete, with customizable intervals.
Ensure you have the required dependencies installed:
pip install -r requirements.txtThe script uses a local SQLite database (file_processing.db) with the following schema:
- file_status:
id(INTEGER): Primary keyfile_name(TEXT): Unique name of the fileexecution_status(TEXT): Status of the file (STARTING,COMPLETED,ERROR, etc.)result(TEXT): API result in JSON formattime_taken(REAL): Time taken to process the filestatus_code(INTEGER): API status codestatus_api_endpoint(TEXT): API endpoint for checking statustotal_embedding_cost(REAL): Total cost incurred for embeddings.total_embedding_tokens(INTEGER): Total tokens used for embeddings.total_llm_cost(REAL): Total cost incurred for LLM operations.total_llm_tokens(INTEGER): Total tokens used for LLM operations.error_message(TEXT): Details of errors ifexecution_statusisERROR; otherwise NULL.updated_at(TEXT): Last updated timestampcreated_at(TEXT): Creation timestamp
Run the script with the following options:
python main.py -hThis will display detailed usage information.
-e,--api_endpoint: API endpoint for processing files.-k,--api_key: API key for authenticating API calls.-f,--input_folder_path: Folder path containing the files to process.
-t,--api_timeout: Timeout (in seconds) for API requests (default: 10).-i,--poll_interval: Interval (in seconds) between API status polls (default: 5).-p,--parallel_call_count: Number of parallel API calls (default: 10).--csv_report: Path to export the detailed report as a CSV file.--db_path: Path where the SQlite DB file is stored (default: './file_processing.db')--recursive: Recursively identify and process files from the input folder path (default: False)--retry_failed: Retry processing of failed files.--retry_pending: Retry processing of pending files by making new requests.--skip_pending: Skip processing of pending files.--skip_unprocessed: Skip unprocessed files when retrying failed files.--log_level: Log level (default:INFO).--print_report: Print a detailed report of all processed files at the end.--exclude_metadata: Exclude metadata on tokens consumed and the context passed to LLMs for prompt studio exported tools in the result for each file.--no_verify: Disable SSL certificate verification. (By default, SSL verification is enabled.)
To process files in the directory /path/to/files using the provided API:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/filesTo retry files that previously encountered errors:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --retry_failedTo skip files that are still pending:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --skip_pendingTo process 20 files in parallel:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files -p 20To generate and display a detailed report at the end of the run:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --print_report- Database: Results and statuses are stored in a local SQLite database (
file_processing.db). - Logging: Logs are printed to stdout with configurable log levels (e.g.,
DEBUG,INFO,ERROR).
Status 'COMPLETED': 50
Status 'ERROR': 10
Status 'PENDING': 5
For more detailed output, you can use the --print_report option to get a per-file breakdown.
The following statuses are tracked for each file during processing:
- STARTING: Initial state when processing begins.
- EXECUTING: File is currently being processed.
- PENDING: File processing is pending or waiting for external actions.
- ERROR: File processing encountered an error.
- COMPLETED: File was processed successfully and will not be processed again unless forced by rerun options.
For more about statuses : API Docs
On Slack, join great conversations around LLMs, their ecosystem and leveraging them to automate the previously unautomatable!
Unstract client: Learn more about Unstract API clint.
Unstract Cloud: Signup and Try!.
Unstract developer documentation: Learn more about Unstract and its API.