Skip to content

Use chunked file reading to avoid loading entire files into memory#37

Open
JanTvrdik wants to merge 1 commit intomainfrom
chunked-file-reading
Open

Use chunked file reading to avoid loading entire files into memory#37
JanTvrdik wants to merge 1 commit intomainfrom
chunked-file-reading

Conversation

@JanTvrdik
Copy link
Member

Summary

  • Added BufferedFileParseTrait with shared logic that reads files in 64 KiB chunks via fopen()/fread() instead of file_get_contents(), so memory usage is proportional to the largest single query rather than the entire file size
  • Refactored all three parsers (MySqlMultiQueryParser, PostgreSqlMultiQueryParser, SqlServerMultiQueryParser) to use the trait — no changes to regex patterns or public API
  • Includes a safety check to prevent \z regex anchors from falsely matching at chunk boundaries before EOF

Test plan

  • All existing tests pass (composer tests — 4 tests, 0 assertions failures)
  • PHPStan reports no errors (composer phpstan)
  • Verify with a large SQL file (100+ MB) that memory stays bounded

Copilot AI review requested due to automatic review settings February 18, 2026 15:06
@JanTvrdik JanTvrdik force-pushed the chunked-file-reading branch from a7322e5 to 04aafe6 Compare February 18, 2026 15:15
Replace file_get_contents() with buffered fopen()/fread() via a shared
BufferedFileParseTrait. Memory usage is now proportional to the largest
single query rather than the entire file size, which matters for large
SQL files (100+ MB). The parsers already use generators for output, so
this completes the streaming pipeline on the input side.
@JanTvrdik JanTvrdik force-pushed the chunked-file-reading branch from 04aafe6 to d768638 Compare February 18, 2026 15:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the multi-query parser to use chunked file reading instead of loading entire files into memory, aiming to reduce memory usage for large SQL files. The implementation introduces a shared BufferedFileParseTrait that reads files in 64 KiB chunks using fopen()/fread() and refactors all three database-specific parsers (MySQL, PostgreSQL, SQL Server) to use this trait. A safety mechanism is included to prevent false \z regex anchor matches at chunk boundaries.

Changes:

  • Introduced BufferedFileParseTrait with chunked file reading logic (64 KiB chunks)
  • Refactored MySQL, PostgreSQL, and SQL Server parsers to use the new trait with callback-based pattern processing
  • Added safety check to handle \z anchor edge cases at chunk boundaries

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/BufferedFileParseTrait.php New trait implementing chunked file reading with buffering, pattern matching, and memory management logic
src/MySqlMultiQueryParser.php Refactored to use BufferedFileParseTrait with callback handling for dynamic delimiter changes
src/PostgreSqlMultiQueryParser.php Refactored to use BufferedFileParseTrait with static callback for query extraction
src/SqlServerMultiQueryParser.php Refactored to use BufferedFileParseTrait with static callback for query extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

break;
}

if (preg_match($pattern, $buffer, $match, 0, $offset) !== 1) {
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffering logic will fail for queries larger than 64 KiB. After reading the first chunk (line 38), if the remaining buffer is >= chunkSize (line 37), the inner loop exits. If preg_match then fails because the query is incomplete (no delimiter found yet), the outer loop breaks (line 52), and a RuntimeException is thrown (line 92).

The fix should ensure that if preg_match fails and we're not at EOF, we continue reading more data instead of breaking. One approach would be to:

  1. Always try to read at least one more chunk after a failed match (if not at EOF)
  2. Or, change the inner loop condition to ensure we have enough buffer data to match complete patterns

This is critical because the PR's goal is to handle large files without loading them entirely into memory, but it fails for any individual query exceeding the chunk size.

Suggested change
if (preg_match($pattern, $buffer, $match, 0, $offset) !== 1) {
if (preg_match($pattern, $buffer, $match, 0, $offset) !== 1) {
// If no match is found but we're not at EOF, try to read more data
if (!$eof) {
$chunk = fread($handle, $chunkSize);
if ($chunk !== false && $chunk !== '') {
$buffer .= $chunk;
$eof = feof($handle);
continue; // retry matching with more data in the buffer
}
$eof = feof($handle);
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments