You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* A lightweight, fast, and optimized XML file splitter with build in tag data validation, written with the XMLParser library. The main goal of this is to split an XML file into multiple small chunks (hence the name), then save it into multiple different little XML files, so that slower servers, plugins etc can process XML files with more than even 10.000+ records. It is built on XMLParser, a powerful php xml processing library.
7
+
*
8
+
* @author Borsodi Gergő
9
+
* @version 1.0
10
+
*/
11
+
class Chunker{
12
+
13
+
privatestring$xmlFile;
14
+
privatereadonlyint$chunkSize;
15
+
privateint$CHUNKS;
16
+
privatestring$PAYLOAD = '';
17
+
privatestring$PAYLOAD_TEMP = '';
18
+
privatestring$rootTag;
19
+
privatestring$CHARSET;
20
+
privatestring$outputFilePrefix;
21
+
privateint$ITEMCOUNT = 0;
22
+
privatestring$CHUNKON;
23
+
privatestring$log = "";
24
+
privateint$totalItems = 0;
25
+
privatebool$excludedItemFound = false;
26
+
privatebool$checkNextData = false;
27
+
privatestring$checkNextDataTag = '';
28
+
privatearray$checkingTags = array();
29
+
private$passesValidation;
30
+
31
+
/**
32
+
* The constructor of the class, it creates an instance of Chunker.
33
+
*
34
+
* @param string $xmlfile The path of the xml file
35
+
* @param int $chunkSize The number of which every little/chunked file should maximum contain from the main XML tag specified lated. **Default: 100**
36
+
* @param string $outputFilePrefix The name that will be the prefix for the chunk's filenames. The pattern is the following: *{outputFilePrefix}{CHUNK_NUMBER}.xml* **Default: 'out-'.** Example files with the default prefix: 'out-1.xml', 'out-2.xml' etc
37
+
* @param callable $validationFunction The validator function to be run every time the parser has found a tag, that is in $checkingTags. If it did, it runs the validator through the tag, and if the function returned **true** (so the tag data was *valid*), it includes it in the chunk, otherwise ignores it. The validator function has to return **bool**, and cannot be **null**. If it is null, a Fatal error will be raised. The passed callback HAS to have the following parameters:
38
+
* - $data: string, the currently processed tag data (what is inside the tag) will be inside this parameter
39
+
* - $tag: string, the currently processed tagname will be inside this parameter
40
+
* @param array $checkingTags This array consists of tagnames where the data inside the tag has to be validated. It can be empty, and can be omitted, if no validation is required (not like the validator function, which HAS to be provided through here, otherwise an error will be raised)
if(empty($xmlfile)) trigger_error("[Chunker] Fatal error: no XML file/empty filestring specified in __construct.", E_USER_ERROR);
46
+
if(!$validationFunction) trigger_error("[Chunker] Fatal error: no callback handler specified for validation checks.", E_USER_ERROR);
47
+
$this->checkingTags = $checkingTags;
48
+
$this->passesValidation = $validationFunction;
49
+
$this->xmlFile = $xmlfile;
50
+
$this->chunkSize = $chunkSize;
51
+
$this->CHUNKS = 0;
52
+
$this->outputFilePrefix = $outputFilePrefix;
53
+
}
54
+
55
+
/**
56
+
* This function processes a whole chunk (max size <= $chunkSize) by writing the **PAYLOAD** into a chunkfile, and resetting all stationary variables.
57
+
* @param bool $lastChunk Indicates if the current is the last chunk in the file. Sometimes if its not indicated, and it is the last chunk, the closing tag is not always present.
* A handler function used by the parser for starting elements. It checks if the currently parsed tag is present in the $checkingTags array, and sets some stationary variables if a validation needs to be done.
84
+
* @param XMLParser $xml The parser
85
+
* @param string $tag the currently parsed tag
86
+
* @param array $attrs An array of attributes of the tag. We dont use it here, so it is only there for syntax purposes
* A handler function used by the parser for ending elements. It checks if the currently parsed main tag had any tags that were present in the $checkingTags array, and had data that couldn't have been validated. If true, the lastly parsed main element will be excluded from the chunking process, and will be written into a chunk file otherwise. If the processed main tag's number has reached the $chunkSize limit, a new chunk will be written to the disk.
112
+
* @param XMLParser $xml The parser
113
+
* @param string $tag the currently parsed tag
114
+
*/
115
+
privatefunctionendElement($xml, $tag) {
116
+
//GLOBAL $CHUNKON, $ITEMCOUNT, $ITEMLIMIT;
117
+
//$this->logging("New closing element: " .$tag);
118
+
$this->dataHandler(null, "</{$tag}>");
119
+
if ($this->CHUNKON == $tag) {
120
+
$this->logging("Closing ".$this->CHUNKON." element found");
121
+
122
+
if($this->excludedItemFound){
123
+
// volt nem passzolo item
124
+
$this->logging("Excluded item found, skipping current " .$this->CHUNKON."..");
* A handler function used by the parser for data between tags. If the $checkNextData stationary property was set to true, then it means, that the currently parsed data has to be validated. It it did not pass the validation, the main element will be flagged as 'excluded from chunking', and will not be written to disk.
* A funcion to start the chunking process. It will initiate the parsint instance, and start the XML parsing, along with the chunking of the data in every specified $chunkSize intervals.
189
+
* @param string $mainTag The tag of which will be used to count the number of main elements in a chunk. Usually the second-level XML tag in a document.
190
+
* @param string $rootTag The root tag of which every other $mainTag is the children of. There is only one of this in an XML document (not the XML header, which is in the first row).
191
+
* @param string $charset The character set used by the parser. **Default: UTF-8**
192
+
*
193
+
* @return string The main log that was created during the chunking
trigger_error("Could not open XML file", E_USER_ERROR);
209
+
}
210
+
$this->logging("Opened XML File");
211
+
$this->CHUNKS = 0;
212
+
$this->totalItems = 0;
213
+
$this->excludedItemFound = false;
214
+
$this->checkNextData = false;
215
+
$this->checkNextDataTag = '';
216
+
$this->PAYLOAD = '';
217
+
$this->PAYLOAD_TEMP = '';
218
+
while(!feof($fp)) {
219
+
//$this->logging("Reading new line...");
220
+
$chunk = fgets($fp, 10240);
221
+
if(!$chunk){
222
+
$this->logging("Reading new line failed, next try");
223
+
}
224
+
if(xml_parse($xml, $chunk, feof($fp)) == 0){
225
+
$this->logging("Could not parse line. Next try...");
226
+
}
227
+
228
+
}
229
+
xml_parser_free($xml);
230
+
231
+
// Now, it is possible that one last chunk is still queued for processing.
232
+
$this->processChunk(true);
233
+
$this->logging("Ended chunking. Total processed '" .$this->CHUNKON."' objects: " .$this->totalItems);
234
+
returnnl2br($this->log);
235
+
}
236
+
/**
237
+
* Used for administrative purposes. A message can be logged into the internal logging variable, and then later be returned/passed back as value by some functions.
238
+
* @param string $msg The message to be logged
239
+
* @param bool $start Indicates if the logging has to be started over (so the past logged messages will be deleted, and a cleared loggin variable will be set). **Default: false**
0 commit comments