|
| 1 | +# Chunker - A lightweight, glazing fast XML splitter written in PHP |
| 2 | + |
| 3 | +The main goal of this library is to create chunks with predefined sizes from a big XML file (or to 'split' it into multiple chunks, so to say). |
| 4 | + |
| 5 | +The algorithm was written using the XMLParser php library, which is capable of parsing an XML file line to line (or tag to tag) without state-control, and not by a string to string comparison or simple I/O operations. This attribute of the library makes it possible to implement validation on the said tags, everytime they are parsed. |
| 6 | + |
| 7 | +## Usage |
| 8 | +### Simple Chunking |
| 9 | +The implementation is Object-oriented, so in order to split the files, an instance of Chunker has to be created first. |
| 10 | + |
| 11 | +An example of a simple Chunker instance without validation, with **maximum 100 main tags**/chunk, and with outputfile names of *"out-{CHUNK}.xml"*: |
| 12 | +``` |
| 13 | +$chunkSize = 100; |
| 14 | +$outputFilePrefix = "out-"; |
| 15 | +$xmlfile = "bigFile.xml"; |
| 16 | +$validationFunction = fn($data, $tag) => { |
| 17 | + return true; |
| 18 | +} |
| 19 | +$checkingTags = array(); |
| 20 | +
|
| 21 | +$chunker = new Chunker($xmlfile, $chunkSize, $outputFilePrefix, $validationFunction, $checkingTags); |
| 22 | +``` |
| 23 | + |
| 24 | + |
| 25 | +### Constructor variables |
| 26 | +The following table contains the parameters that can be (and should be) passed to the constructor. |
| 27 | + |
| 28 | +| Parameter | Type | Description | Default value | Is required | |
| 29 | +| --------- | ---- | ----------- | ------------- | ----------- | |
| 30 | +| $xmlfile | string | The big XML file to be chunked | empty string | Yes | |
| 31 | +| $chunkSize | int | The number of main tags maximum in a chunk | 100 | No | |
| 32 | +| $outputFilePrefix | string | The prefix that will be used as the filename for the output chunks. Pattern: **'{outputFilePrefix}{CHUNK-NUMBER}.xml'** | 'out-' | No | |
| 33 | +| $validationFunction | callable | The validator function that is used everytime a tag found, that is inside $checkingTags. If the tag data passes the validation, it will be included in the chunks, and will not be otherwise. It has to receive **two parameters**: first is the *data* that is inside the tag to be validated, and the second is the *tag* itself (both being strings). It has to **return a boolean**. | null | Yes | |
| 34 | +| $checkingTags | array | An array of tags, where their data has to be validated using the $validationFunction callable. If we don't want any validation, we can pass an empty array to this parameter, or not specify it at all since it's not required. | empy array | No | |
| 35 | + |
| 36 | +If any of the required parameters are empty/not specified, a Fatal error will be raised. |
| 37 | + |
| 38 | +### Launch the chunking! |
| 39 | + |
| 40 | +After you created an instance of Chunker, and all the parameters were set, you can start the chunking process. You can do this with the `Chunker::chunkXML` method. An example is shown below: |
| 41 | +``` |
| 42 | +// ... the instance is created in $chunker |
| 43 | +$chunker.chunkXML("item", "root"); |
| 44 | +``` |
| 45 | + |
| 46 | +This example will create xml chunks from the big file (if validation is enabled, then only the validated main tags will be included), with `$chunkSize` number of *main tags* (here it's called **"item"**). Every main tag is enclosed between one *root tag* (here it's called **"root"**) in every file (so every chunked file will contain **one root tag**, and `$chunkSize` number of **main tags inside** it). |
| 47 | + |
| 48 | +THe method returns the logging session's string conversion (see below for more information). |
| 49 | + |
| 50 | +## Logging |
| 51 | + |
| 52 | +The class has an implemented logging feature. Everytime the `Chunker::chunkXML` is run, a new logging session is launched, which can be retrieved with the very same function. After its run, it returns the logging session converted into string: |
| 53 | +``` |
| 54 | +// ... |
| 55 | +$log = $chunker.chunkXML(....); |
| 56 | +echo $log; |
| 57 | +
|
| 58 | +/* |
| 59 | +
|
| 60 | +[timestamp] Starting new chunking... |
| 61 | +[timestamp] .. |
| 62 | +[timestamp] .. |
| 63 | +*/ |
| 64 | +
|
| 65 | +``` |
| 66 | +It is really helpful, when something is not working for your needs, and has to be debugged from step to step. **It is not neccessary to catch, so you can just call the function like its return value is void.** |
| 67 | + |
| 68 | +## Examples |
| 69 | + |
| 70 | +### Basic validation |
| 71 | + |
| 72 | +Lets say, that you have an XML file (*"feed.xml"*) with a **Shop** root element, and multiple **shopItem** elements inside it (10.000+). You want it to split into files named *"feed-{chunk}.xml"* containing 1000 **shopItem**s maximum. And you also want to only include **shopItem**s, that has a *weight_kg* tag inside, which can only be greater than 10 (or '10 kgs'). The solution is like the following: |
| 73 | + |
| 74 | +``` |
| 75 | +$chunkSize = 1000; |
| 76 | +$xmlfile = "feed.xml"; |
| 77 | +$outPrefix = "feed-"; |
| 78 | +$checkingTags = array("weight_kg"); |
| 79 | +function validation($data, $tag) { |
| 80 | + if($tag == "weight_kg"){ |
| 81 | + if(!empty($data) && intval($data) > 0) return true; |
| 82 | + } |
| 83 | + return false; |
| 84 | +} |
| 85 | +
|
| 86 | +$mainTag = "shopItem"; |
| 87 | +$rootTag = "Shop"; |
| 88 | +
|
| 89 | +$chunker = new Chunker($xmlfile, $chunkSize, $outPrefix, "validation", $chekingTags); |
| 90 | +$chunker.chunkXML($mainTag, $rootTag); |
| 91 | +``` |
| 92 | + |
0 commit comments