Skip to content
29 changes: 21 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -597,6 +597,11 @@ root
|-- Record_Bytes: binary (nullable = false)
```

You can generate `_corrupt_fields` that will contain original binary values of fields Cobrix was unable to decode:
```scala
.option("generate_corrupt_fields", "true")
```

### Locality optimization for variable-length records parsing

Variable-length records depend on headers to have their length calculated, which makes it hard to achieve parallelism while parsing.
Expand Down Expand Up @@ -1557,6 +1562,7 @@ The output looks like this:
| .option("non_terminals", "GROUP1,GROUP2") | Specifies groups to also be added to the schema as string fields. When this option is specified, the reader will add one extra data field after each matching group containing the string data for the group. |
| .option("generate_record_id", false) | Generate autoincremental 'File_Id', 'Record_Id' and 'Record_Byte_Length' fields. This is used for processing record order dependent data. |
| .option("generate_record_bytes", false) | Generate 'Record_Bytes', the binary field that contains raw contents of the original unparsed records. |
| .option("generate_corrupt_fields", false) | Generate `_corrupt_fields` field that contains values of fields Cobrix was unable to decode. |
| .option("with_input_file_name_col", "file_name") | Generates a column containing input file name for each record (Similar to Spark SQL `input_file_name()` function). The column name is specified by the value of the option. This option only works for variable record length files. For fixed record length and ASCII files use `input_file_name()`. |
| .option("metadata", "basic") | Specifies wat kind of metadata to include in the Spark schema: `false`, `basic`(default), or `extended` (PIC, usage, etc). |
| .option("debug", "hex") | If specified, each primitive field will be accompanied by a debug field containing raw bytes from the source file. Possible values: `none` (default), `hex`, `binary`, `string` (ASCII only). The legacy value `true` is supported and will generate debug fields in HEX. |
Expand Down Expand Up @@ -1945,19 +1951,26 @@ at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
A: Update hadoop dll to version 3.2.2 or newer.

## Changelog
- #### 2.9.8 will be released soon.
- [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `_corrupt_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
```scala
.option("generate_corrupt_fields", "true")
```

Comment on lines 1954 to 1960
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level increment (MD001).
The changelog jumps from ## to ####. Use ### for the 2.9.8 entry.

📝 Suggested fix
-- #### 2.9.8 will be released soon.
+- ### 2.9.8 will be released soon.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- #### 2.9.8 will be released soon.
- [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
```scala
.option("generate_corrupted_fields", "true")
```
- ### 2.9.8 will be released soon.
- [`#723`](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 1954-1954: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents
In `@README.md` around lines 1954 - 1960, The changelog heading for the 2.9.8
entry uses #### which skips a level (MD001); update the heading text "2.9.8 will
be released soon." to use a triple-hash heading (###) instead of #### so it
correctly follows the prior `##` level; ensure the surrounding block (the bullet
about [`#723`] and the code example `.option("generate_corrupted_fields",
"true")`) remains unchanged and properly nested under the new `###` heading.

- #### 2.9.7 released 29 January 2026.
- [#816](https://github.com/AbsaOSS/cobrix/pull/816) Fixed the reliance on log4j libraries in the classpath. Cobrix can now be run on clusters that do not use Log4j for logging.
- [#816](https://github.com/AbsaOSS/cobrix/pull/816) Fixed the reliance on log4j libraries in the classpath. Cobrix can now be run on clusters that do not use Log4j for logging.

- #### 2.9.6 released 7 January 2026.
- [#813](https://github.com/AbsaOSS/cobrix/pull/813) Fixed compatibility of the relaxed sign overpunching. Allow numbers
with overpunched sign in unsigned numbers and allow multiple digits when overpunched sign when `strict_sign_overpunching = true`.
- [#813](https://github.com/AbsaOSS/cobrix/pull/813) Fixed compatibility of the relaxed sign overpunching. Allow numbers
with overpunched sign in unsigned numbers and allow multiple digits when overpunched sign when `strict_sign_overpunching = true`.

- #### 2.9.5 released 22 December 2025.
- [#809](https://github.com/AbsaOSS/cobrix/pull/809) Add support for reading compressed EBCDIC files. All compression
supported by Hadoop (.gz, .bz2, etc) are also supported by Cobrix because Cobrix uses Hadoop compressed streams for
reading such files.
- [#811](https://github.com/AbsaOSS/cobrix/pull/811) Add read properties hash code as index key to avoid false cache.
This makes index caching safe to use by default, so index caching is now turned on by default.
- [#809](https://github.com/AbsaOSS/cobrix/pull/809) Add support for reading compressed EBCDIC files. All compression
supported by Hadoop (.gz, .bz2, etc) are also supported by Cobrix because Cobrix uses Hadoop compressed streams for
reading such files.
- [#811](https://github.com/AbsaOSS/cobrix/pull/811) Add read properties hash code as index key to avoid false cache.
This makes index caching safe to use by default, so index caching is now turned on by default.
- #### 2.9.4 released 26 November 2025.
- [#805](https://github.com/AbsaOSS/cobrix/pull/805) Added the option to cache VRL indexes for better performance when same files are processed multiple times.
```scala
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@

package za.co.absa.cobrix.cobol.parser.antlr

import java.nio.charset.Charset
import org.antlr.v4.runtime.{ParserRuleContext, RuleContext}
import za.co.absa.cobrix.cobol.parser.CopybookParser
import za.co.absa.cobrix.cobol.parser.CopybookParser.CopybookAST
Expand All @@ -25,13 +24,14 @@ import za.co.absa.cobrix.cobol.parser.ast.{Group, Primitive}
import za.co.absa.cobrix.cobol.parser.common.Constants
import za.co.absa.cobrix.cobol.parser.decoders.DecoderSelector
import za.co.absa.cobrix.cobol.parser.decoders.FloatingPointFormat.FloatingPointFormat
import za.co.absa.cobrix.cobol.parser.encoding.codepage.CodePage
import za.co.absa.cobrix.cobol.parser.encoding._
import za.co.absa.cobrix.cobol.parser.encoding.codepage.CodePage
import za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException
import za.co.absa.cobrix.cobol.parser.policies.CommentPolicy
import za.co.absa.cobrix.cobol.parser.policies.StringTrimmingPolicy.StringTrimmingPolicy
import za.co.absa.cobrix.cobol.parser.position.{Left, Position, Right}

import java.nio.charset.Charset
import scala.collection.JavaConverters._
import scala.collection.mutable
import scala.util.matching.Regex
Expand Down Expand Up @@ -168,7 +168,7 @@ class ParserVisitor(enc: Encoding,
int.copy(compact=usage)
case x: AlphaNumeric if usageVal == COMP3U() =>
Integral(x.pic, x.length*2, None, false, None, Some(COMP3U()), None, x.originalPic)
case x: AlphaNumeric if usageVal == COMP1() || usageVal == COMP4() =>
case x: AlphaNumeric if usageVal == COMP4() || usageVal == COMP9() =>
val enc = if (decodeBinaryAsHex) HEX else RAW
x.copy(compact=usage, enc=Some(enc))
case x: AlphaNumeric =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@

package za.co.absa.cobrix.cobol.parser.ast

import za.co.absa.cobrix.cobol.parser.ast.datatype.{AlphaNumeric, CobolType, Decimal, Integral}
import za.co.absa.cobrix.cobol.parser.ast.datatype.{AlphaNumeric, COMP3, CobolType, Decimal, Integral}
import za.co.absa.cobrix.cobol.parser.decoders.{BinaryUtils, DecoderSelector}
import za.co.absa.cobrix.cobol.parser.encoding.EncoderSelector
import za.co.absa.cobrix.cobol.parser.encoding.{ASCII, EBCDIC, EncoderSelector}

/** An abstraction of the statements describing fields of primitive data types in the COBOL copybook
*
Expand Down Expand Up @@ -63,6 +63,44 @@ case class Primitive(
/** This is cached value specifying if the field is a string */
private val isString = dataType.isInstanceOf[AlphaNumeric]

/** This is cached value to speedup checking for empty values */
private val spaceChar: Byte = {
dataType match {
case t: AlphaNumeric =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
case t: Integral =>
t.compact match {
case Some(COMP3()) => 0x40
case Some(_) => 0
case None =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
}
case t: Decimal =>
t.compact match {
case Some(COMP3()) => 0x40
case Some(_) => 0
case None =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
}
case _ => 0
}
Comment on lines +66 to +101
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's understand how spaceChar is used
rg -A 5 -B 5 "spaceChar" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 630


🏁 Script executed:

# Search for usages of spaceChar throughout the codebase
rg "spaceChar" --type scala -A 3 -B 3

Repository: AbsaOSS/cobrix

Length of output: 3940


🏁 Script executed:

# Look for COMP-3 handling in reader/extractor code
rg "COMP3|0x40" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 -B 2

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Check for empty/null value detection in readers
rg "isEmpty|isNull|empty.*value" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 | head -50

Repository: AbsaOSS/cobrix

Length of output: 5846


🏁 Script executed:

# Look for test cases related to COMP-3 empty values
fd ".*Suite\.scala$" cobol-parser/src/test --exec rg -l "COMP3|comp3" {} \;

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Find test files mentioning COMP3 or packed decimal
find cobol-parser/src/test -name "*.scala" -exec grep -l "COMP3\|comp3\|packed" {} \;

Repository: AbsaOSS/cobrix

Length of output: 329


🏁 Script executed:

# Check for corrupted field handling related to COMP-3
rg "corrupted|Corrupt" cobol-parser/src/main/scala --type scala -B 2 -A 2 | grep -A 5 -B 5 "COMP3"

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Look for isEmpty method implementation and all its usages
rg "def isEmpty" cobol-parser/src/main/scala --type scala -A 10

Repository: AbsaOSS/cobrix

Length of output: 1312


🏁 Script executed:

# Search for any configuration or documentation about null encoding
rg "null.*encod|encod.*null|0x00|zero.*byte" cobol-parser/src/main/scala --type scala | head -20

Repository: AbsaOSS/cobrix

Length of output: 3289


🏁 Script executed:

# Let's look at the BinaryDecoderSpec test
cat cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala | head -100

Repository: AbsaOSS/cobrix

Length of output: 10149


🏁 Script executed:

# Check isEmpty usage in tests
rg "isEmpty" cobol-parser/src/test --type scala -B 3 -A 3

Repository: AbsaOSS/cobrix

Length of output: 24980


🏁 Script executed:

# Look for COMP3 in the test files we found
rg "COMP3|comp3" cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala -B 5 -A 5

Repository: AbsaOSS/cobrix

Length of output: 4525


🏁 Script executed:

# Let's check the full isEmpty method
sed -n '102,130p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 1087


🏁 Script executed:

# Look for the complete Primitive.scala spaceChar definition with context
sed -n '60,110p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 1540


🏁 Script executed:

# Search for any test that calls isEmpty or checks empty values with COMP-3
rg "isEmpty|empty" cobol-parser/src/test --type scala | grep -i "comp3\|packed"

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Check for usage of isEmpty method in the codebase
rg "\.isEmpty\(" cobol-parser/src/main --type scala -B 2 | head -40

Repository: AbsaOSS/cobrix

Length of output: 1013


Change COMP‑3 empty byte from 0x40 to 0x00 to properly detect corruption.

COMP-3 is packed decimal format where 0x40 is meaningless data. Using 0x40 as the empty indicator causes fields with invalid all-0x40 bytes to be incorrectly classified as empty and excluded from corruption reporting. This contradicts the design principle that null values are encoded as 0x00 for all field types. Changing this to 0x00 aligns with the null encoding scheme and ensures corrupted COMP-3 data is properly detected.

Suggested fix
-          case Some(COMP3()) => 0x40
+          case Some(COMP3()) => 0
-          case Some(COMP3()) => 0x40
+          case Some(COMP3()) => 0
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`
around lines 66 - 101, The cached spaceChar value in Primitive.scala incorrectly
treats COMP-3 empty byte as 0x40; locate the private val spaceChar (pattern
matching on dataType) and change the COMP3() cases inside the Integral and
Decimal branches from returning 0x40 to returning 0x00 so COMP-3 fields follow
the project-wide null encoding; ensure both occurrences (the t.compact match in
Integral and the t.compact match in Decimal) are updated and run tests to verify
corruption detection now flags all-0x40 packed fields.

}

/** Returns a string representation of the field */
override def toString: String = {
s"${" " * 2 * level}$camelCased ${camelCase(redefines.getOrElse(""))} $dataType"
Expand Down Expand Up @@ -90,7 +128,6 @@ case class Primitive(
copy(dependingOnHandlers = newDependingOnHandlers)(parent)
}


/** Returns the binary size in bits for the field */
def getBinarySizeBytes: Int = {
dataType match {
Expand All @@ -110,6 +147,63 @@ case class Primitive(
* @param record A record in a binary format represented as a vector of bits
*/
def decodeTypeValue(itOffset: Int, record: Array[Byte]): Any = {
val bytes = getRawValue(itOffset, record)

if (bytes == null) null else decode(bytes)
}

/**
* Checks if a value extracted from a given binary record at a specified offset is considered empty.
* A value is considered empty if it contains only null bytes or bytes equal to a space character
* of the underlying encoding (e.g., 0x40 for EBCDIC, 0x20 for ASCII, 0x00 for binary).
*
* @param itOffset The offset within the binary record where the value starts.
* @param record The binary record represented as an array of bytes.
* @return `true` if the value is empty, otherwise `false`.
*/
def isEmpty(itOffset: Int, record: Array[Byte]): Boolean = {
val bytesCount = binaryProperties.dataSize
val idx = itOffset

if (isString) {
// The length of a string can be smaller for varchar fields at the end of a record
if (idx > record.length) {
return true
}
} else {
// Non-string field size should exactly fit the required bytes
if (idx + bytesCount > record.length) {
return true
}
}

// Determine the actual number of bytes to copy based on the record size.
// Varchar fields can be trimmed by the record size.
val endIndex = if (idx + bytesCount > record.length) {
record.length
} else {
idx + bytesCount
}
var i = idx
while (i < endIndex) {
if (record(i) != 0 && record(i) != spaceChar) {
return false
}
i += 1
}
true
}

/**
* Extracts a raw byte array representation of a value from a binary record
* based on the specified offset and the field's properties.
*
* @param itOffset The offset within the binary record where the value starts.
* @param record The binary record represented as an array of bytes.
* @return An array of bytes representing the value, or `null` if the offset
* or size is invalid for the given binary record.
*/
def getRawValue(itOffset: Int, record: Array[Byte]): Array[Byte] = {
val bytesCount = binaryProperties.dataSize
val idx = itOffset

Expand All @@ -119,7 +213,7 @@ case class Primitive(
return null
}
} else {
// Non-string field size should exactly fix the required bytes
// Non-string field size should exactly fit the required bytes
if (idx + bytesCount > record.length) {
return null
}
Expand All @@ -132,9 +226,7 @@ case class Primitive(
} else {
bytesCount
}
val bytes = java.util.Arrays.copyOfRange(record, idx, idx + bytesToCopy)

decode(bytes)
java.util.Arrays.copyOfRange(record, idx, idx + bytesToCopy)
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ package za.co.absa.cobrix.cobol.parser.ast.datatype

sealed trait Usage

//case class COMP() extends Usage
//case class COMP0() extends Usage
//case class COMP() extends Usage // Use COMP4()
//case class COMP0() extends Usage // Use COMP4()
case class COMP1() extends Usage {
override def toString = "COMP-1"
}
Expand All @@ -44,8 +44,8 @@ case class COMP5() extends Usage {
case class COMP9() extends Usage { // artificial little-endian binary
override def toString = "COMP-9"
}
//case class DISPLAY() extends Usage {
//case class DISPLAY() extends Usage { // Use None for the USAGE instead
// override def toString = "DISPLAY"
//}
//case class BINARY() extends Usage
//case class BINARY() extends Usage // Use COMP4()

Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,9 @@ object Constants {
val recordIdField = "Record_Id"
val recordByteLength = "Record_Byte_Length"
val recordBytes = "Record_Bytes"
val corruptFieldsField = "_corrupt_fields"
val fieldNameColumn = "field_name"
val rawValueColumn = "raw_value"

// Non-terminals
val nonTerminalsPostfix = "_NT"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/*
* Copyright 2018 ABSA Group Limited
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package za.co.absa.cobrix.cobol.reader.extractors.record

case class CorruptField(
fieldName: String,
rawValue: Array[Byte]
)
Loading