File Processing Script

This script processes files recursively from a specified directory using an API, logs results in a local SQLite database, and provides options for retrying failed or pending files. It includes features for skipping specific files, generating reports, and running multiple API calls in parallel.

Features

Parallel Processing: Process files in parallel, with the number of parallel calls configurable.
Status Tracking: Tracks the execution status, results, and time taken for each file in an SQLite database.
Retry Logic: Options to retry failed or pending files, or to skip them.
Detailed Reporting: Prints a summary of file processing and provides a detailed report.
Polling: Polls the API until the result is complete, with customizable intervals.

Dependencies

Ensure you have the required dependencies installed:

pip install -r requirements.txt

SQLite Database Schema

The script uses a local SQLite database (file_processing.db) with the following schema:

file_status:
- id (INTEGER): Primary key
- file_name (TEXT): Unique name of the file
- execution_status (TEXT): Status of the file (STARTING, COMPLETED, ERROR, etc.)
- result (TEXT): API result in JSON format
- time_taken (REAL): Time taken to process the file
- status_code (INTEGER): API status code
- status_api_endpoint (TEXT): API endpoint for checking status
- total_embedding_cost (REAL): Total cost incurred for embeddings.
- total_embedding_tokens (INTEGER): Total tokens used for embeddings.
- total_llm_cost (REAL): Total cost incurred for LLM operations.
- total_llm_tokens (INTEGER): Total tokens used for LLM operations.
- error_message (TEXT): Details of errors if execution_status is ERROR; otherwise NULL.
- updated_at (TEXT): Last updated timestamp
- created_at (TEXT): Creation timestamp

Command Line Arguments

Run the script with the following options:

python main.py -h

This will display detailed usage information.

Required Arguments:

-e, --api_endpoint: API endpoint for processing files.
-k, --api_key: API key for authenticating API calls.
-f, --input_folder_path: Folder path containing the files to process.

Optional Arguments:

-t, --api_timeout: Timeout (in seconds) for API requests (default: 10).
-i, --poll_interval: Interval (in seconds) between API status polls (default: 5).
-p, --parallel_call_count: Number of parallel API calls (default: 10).
--csv_report: Path to export the detailed report as a CSV file.
--db_path: Path where the SQlite DB file is stored (default: './file_processing.db')
--recursive: Recursively identify and process files from the input folder path (default: False)
--retry_failed: Retry processing of failed files.
--retry_pending: Retry processing of pending files by making new requests.
--skip_pending: Skip processing of pending files.
--skip_unprocessed: Skip unprocessed files when retrying failed files.
--log_level: Log level (default: INFO).
--print_report: Print a detailed report of all processed files at the end.
--exclude_metadata: Exclude metadata on tokens consumed and the context passed to LLMs for prompt studio exported tools in the result for each file.
--no_verify: Disable SSL certificate verification. (By default, SSL verification is enabled.)

Usage Examples

Basic Usage

To process files in the directory /path/to/files using the provided API:

python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files

Retry Failed Files

To retry files that previously encountered errors:

python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --retry_failed

Skip Pending Files

To skip files that are still pending:

python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --skip_pending

Parallel Processing

To process 20 files in parallel:

python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files -p 20

Print Detailed Report

To generate and display a detailed report at the end of the run:

python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --print_report

Database and Logging

Database: Results and statuses are stored in a local SQLite database (file_processing.db).
Logging: Logs are printed to stdout with configurable log levels (e.g., DEBUG, INFO, ERROR).

Example Output

Status 'COMPLETED': 50
Status 'ERROR': 10
Status 'PENDING': 5

For more detailed output, you can use the --print_report option to get a per-file breakdown.

Status Definitions

The following statuses are tracked for each file during processing:

STARTING: Initial state when processing begins.
EXECUTING: File is currently being processed.
PENDING: File processing is pending or waiting for external actions.
ERROR: File processing encountered an error.
COMPLETED: File was processed successfully and will not be processed again unless forced by rerun options.

For more about statuses : API Docs

Questions and Feedback

On Slack, join great conversations around LLMs, their ecosystem and leveraging them to automate the previously unautomatable!

Unstract client: Learn more about Unstract API clint.

Unstract Cloud: Signup and Try!.

Unstract developer documentation: Learn more about Unstract and its API.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

File Processing Script

Features

Dependencies

SQLite Database Schema

Command Line Arguments

Required Arguments:

Optional Arguments:

Usage Examples

Basic Usage

Retry Failed Files

Skip Pending Files

Parallel Processing

Print Detailed Report

Database and Logging

Example Output

Status Definitions

Questions and Feedback

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Zipstack/unstract-api-deployments-batch-run

Folders and files

Latest commit

History

Repository files navigation

File Processing Script

Features

Dependencies

SQLite Database Schema

Command Line Arguments

Required Arguments:

Optional Arguments:

Usage Examples

Basic Usage

Retry Failed Files

Skip Pending Files

Parallel Processing

Print Detailed Report

Database and Logging

Example Output

Status Definitions

Questions and Feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages