AbsaOSS · yruslan · Feb 11, 2026 · Feb 9, 2026 · Feb 10, 2026 · Feb 10, 2026
@@ -597,6 +597,11 @@ root
  |-- Record_Bytes: binary (nullable = false)
 ```
 
+You can generate `_corrupt_fields` that will contain original binary values of fields Cobrix was unable to decode:
+```scala
+.option("generate_corrupt_fields", "true")
+```
+
 ### Locality optimization for variable-length records parsing
 
 Variable-length records depend on headers to have their length calculated, which makes it hard to achieve parallelism while parsing.
@@ -1557,6 +1562,7 @@ The output looks like this:
 | .option("non_terminals", "GROUP1,GROUP2")           | Specifies groups to also be added to the schema as string fields. When this option is specified, the reader will add one extra data field after each matching group containing the string data for the group.                                                                                         |
 | .option("generate_record_id", false)                | Generate autoincremental 'File_Id', 'Record_Id' and 'Record_Byte_Length' fields. This is used for processing record order dependent data.                                                                                                                                                             |
 | .option("generate_record_bytes", false)             | Generate 'Record_Bytes', the binary field that contains raw contents of the original unparsed records.                                                                                                                                                                                                |
+| .option("generate_corrupt_fields", false)           | Generate `_corrupt_fields` field that contains values of fields Cobrix was unable to decode.                                                                                                                                                                                                          |
 | .option("with_input_file_name_col", "file_name")    | Generates a column containing input file name for each record (Similar to Spark SQL `input_file_name()` function). The column name is specified by the value of the option. This option only works for variable record length files. For fixed record length and ASCII files use `input_file_name()`. |
 | .option("metadata", "basic")                        | Specifies wat kind of metadata to include in the Spark schema: `false`, `basic`(default), or `extended` (PIC, usage, etc).                                                                                                                                                                            |
 | .option("debug", "hex")                             | If specified, each primitive field will be accompanied by a debug field containing raw bytes from the source file. Possible values: `none` (default), `hex`, `binary`, `string` (ASCII only). The legacy value `true` is supported and will generate debug fields in HEX.                             |
@@ -1945,19 +1951,26 @@ at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
 A: Update hadoop dll to version 3.2.2 or newer.
 
 ## Changelog
+- #### 2.9.8 will be released soon.
+  - [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `_corrupt_fields` field that contains field names and raw values
+    of fields that Cobrix couldn't decode.
+    ```scala
+    .option("generate_corrupt_fields", "true")
+    ```
+
- #### 2.9.8 will be released soon.
-  - [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
-    of fields that Cobrix couldn't decode.
-    ```scala
-    .option("generate_corrupted_fields", "true")
-    ```
+- ### 2.9.8 will be released soon.
+  - [`#723`](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
+    of fields that Cobrix couldn't decode.
- #### 2.9.8 will be released soon.
-  - [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
-    of fields that Cobrix couldn't decode.
-    ```scala
-    .option("generate_corrupted_fields", "true")
-    ```
+- ### 2.9.8 will be released soon.
+  - [`#723`](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
+    of fields that Cobrix couldn't decode.
 - #### 2.9.7 released 29 January 2026.
-    - [#816](https://github.com/AbsaOSS/cobrix/pull/816) Fixed the reliance on log4j libraries in the classpath. Cobrix can now be run on clusters that do not use Log4j for logging.
+  - [#816](https://github.com/AbsaOSS/cobrix/pull/816) Fixed the reliance on log4j libraries in the classpath. Cobrix can now be run on clusters that do not use Log4j for logging.
 
 - #### 2.9.6 released 7 January 2026.
-    - [#813](https://github.com/AbsaOSS/cobrix/pull/813) Fixed compatibility of the relaxed sign overpunching. Allow numbers
-      with overpunched sign in unsigned numbers and allow multiple digits when overpunched sign when `strict_sign_overpunching = true`.
+  - [#813](https://github.com/AbsaOSS/cobrix/pull/813) Fixed compatibility of the relaxed sign overpunching. Allow numbers
+    with overpunched sign in unsigned numbers and allow multiple digits when overpunched sign when `strict_sign_overpunching = true`.
 
 - #### 2.9.5 released 22 December 2025.
-    - [#809](https://github.com/AbsaOSS/cobrix/pull/809) Add support for reading compressed EBCDIC files. All compression 
-      supported by Hadoop (.gz, .bz2, etc) are also supported by Cobrix because Cobrix uses Hadoop compressed streams for
-      reading such files. 
-    - [#811](https://github.com/AbsaOSS/cobrix/pull/811) Add read properties hash code as index key to avoid false cache.
-      This makes index caching safe to use by default, so index caching is now turned on by default.
+  - [#809](https://github.com/AbsaOSS/cobrix/pull/809) Add support for reading compressed EBCDIC files. All compression 
+    supported by Hadoop (.gz, .bz2, etc) are also supported by Cobrix because Cobrix uses Hadoop compressed streams for
+    reading such files. 
+  - [#811](https://github.com/AbsaOSS/cobrix/pull/811) Add read properties hash code as index key to avoid false cache.
+     This makes index caching safe to use by default, so index caching is now turned on by default.
 - #### 2.9.4 released 26 November 2025.
   - [#805](https://github.com/AbsaOSS/cobrix/pull/805) Added the option to cache VRL indexes for better performance when same files are processed multiple times.
     ```scala

@@ -16,7 +16,6 @@
 
 package za.co.absa.cobrix.cobol.parser.antlr
 
-import java.nio.charset.Charset
 import org.antlr.v4.runtime.{ParserRuleContext, RuleContext}
 import za.co.absa.cobrix.cobol.parser.CopybookParser
 import za.co.absa.cobrix.cobol.parser.CopybookParser.CopybookAST
@@ -25,13 +24,14 @@ import za.co.absa.cobrix.cobol.parser.ast.{Group, Primitive}
 import za.co.absa.cobrix.cobol.parser.common.Constants
 import za.co.absa.cobrix.cobol.parser.decoders.DecoderSelector
 import za.co.absa.cobrix.cobol.parser.decoders.FloatingPointFormat.FloatingPointFormat
-import za.co.absa.cobrix.cobol.parser.encoding.codepage.CodePage
 import za.co.absa.cobrix.cobol.parser.encoding._
+import za.co.absa.cobrix.cobol.parser.encoding.codepage.CodePage
 import za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException
 import za.co.absa.cobrix.cobol.parser.policies.CommentPolicy
 import za.co.absa.cobrix.cobol.parser.policies.StringTrimmingPolicy.StringTrimmingPolicy
 import za.co.absa.cobrix.cobol.parser.position.{Left, Position, Right}
 
+import java.nio.charset.Charset
 import scala.collection.JavaConverters._
 import scala.collection.mutable
 import scala.util.matching.Regex
@@ -168,7 +168,7 @@ class ParserVisitor(enc: Encoding,
               int.copy(compact=usage)
             case x: AlphaNumeric if usageVal == COMP3U() =>
               Integral(x.pic, x.length*2, None, false, None, Some(COMP3U()), None, x.originalPic)
-            case x: AlphaNumeric if usageVal == COMP1() || usageVal == COMP4() =>
+            case x: AlphaNumeric if usageVal == COMP4() || usageVal == COMP9() =>
               val enc = if (decodeBinaryAsHex) HEX else RAW
               x.copy(compact=usage, enc=Some(enc))
             case x: AlphaNumeric =>

@@ -16,9 +16,9 @@
 
 package za.co.absa.cobrix.cobol.parser.ast
 
-import za.co.absa.cobrix.cobol.parser.ast.datatype.{AlphaNumeric, CobolType, Decimal, Integral}
+import za.co.absa.cobrix.cobol.parser.ast.datatype.{AlphaNumeric, COMP3, CobolType, Decimal, Integral}
 import za.co.absa.cobrix.cobol.parser.decoders.{BinaryUtils, DecoderSelector}
-import za.co.absa.cobrix.cobol.parser.encoding.EncoderSelector
+import za.co.absa.cobrix.cobol.parser.encoding.{ASCII, EBCDIC, EncoderSelector}
 
 /** An abstraction of the statements describing fields of primitive data types in the COBOL copybook
   *
@@ -63,6 +63,44 @@ case class Primitive(
   /** This is cached value specifying if the field is a string */
   private val isString = dataType.isInstanceOf[AlphaNumeric]
 
+  /** This is cached value to speedup checking for empty values */
+  private val spaceChar: Byte = {
+    dataType match {
+      case t: AlphaNumeric =>
+        t.enc match {
+          case Some(EBCDIC) => 0x40
+          case Some(ASCII) => 0x20
+          case Some(_) => 0
+          case None => 0x40
+        }
+      case t: Integral =>
+        t.compact match {
+          case Some(COMP3()) => 0x40
+          case Some(_) => 0
+          case None =>
+            t.enc match {
+              case Some(EBCDIC) => 0x40
+              case Some(ASCII) => 0x20
+              case Some(_) => 0
+              case None => 0x40
+            }
+        }
+      case t: Decimal =>
+        t.compact match {
+          case Some(COMP3()) => 0x40
+          case Some(_) => 0
+          case None =>
+            t.enc match {
+              case Some(EBCDIC) => 0x40
+              case Some(ASCII) => 0x20
+              case Some(_) => 0
+              case None => 0x40
+            }
+        }
+      case _ => 0
+    }
+  }
+
   /** Returns a string representation of the field */
   override def toString: String = {
     s"${" " * 2 * level}$camelCased ${camelCase(redefines.getOrElse(""))} $dataType"
@@ -90,7 +128,6 @@ case class Primitive(
     copy(dependingOnHandlers = newDependingOnHandlers)(parent)
   }
 
-
   /** Returns the binary size in bits for the field */
   def getBinarySizeBytes: Int = {
     dataType match {
@@ -110,6 +147,63 @@ case class Primitive(
     * @param record   A record in a binary format represented as a vector of bits
     */
   def decodeTypeValue(itOffset: Int, record: Array[Byte]): Any = {
+    val bytes = getRawValue(itOffset, record)
+
+    if (bytes == null) null else decode(bytes)
+  }
+
+  /**
+    * Checks if a value extracted from a given binary record at a specified offset is considered empty.
+    * A value is considered empty if it contains only null bytes or bytes equal to a space character
+    * of the underlying encoding (e.g., 0x40 for EBCDIC, 0x20 for ASCII, 0x00 for binary).
+    *
+    * @param itOffset The offset within the binary record where the value starts.
+    * @param record   The binary record represented as an array of bytes.
+    * @return `true` if the value is empty, otherwise `false`.
+    */
+  def isEmpty(itOffset: Int, record: Array[Byte]): Boolean = {
+    val bytesCount = binaryProperties.dataSize
+    val idx = itOffset
+
+    if (isString) {
+      // The length of a string can be smaller for varchar fields at the end of a record
+      if (idx > record.length) {
+        return true
+      }
+    } else {
+      // Non-string field size should exactly fit the required bytes
+      if (idx + bytesCount > record.length) {
+        return true
+      }
+    }
+
+    // Determine the actual number of bytes to copy based on the record size.
+    // Varchar fields can be trimmed by the record size.
+    val endIndex = if (idx + bytesCount > record.length) {
+      record.length
+    } else {
+      idx + bytesCount
+    }
+    var i = idx
+    while (i < endIndex) {
+      if (record(i) != 0 && record(i) != spaceChar) {
+        return false
+      }
+      i += 1
+    }
+    true
+  }
+
+  /**
+    * Extracts a raw byte array representation of a value from a binary record
+    * based on the specified offset and the field's properties.
+    *
+    * @param itOffset The offset within the binary record where the value starts.
+    * @param record   The binary record represented as an array of bytes.
+    * @return An array of bytes representing the value, or `null` if the offset
+    *         or size is invalid for the given binary record.
+    */
+  def getRawValue(itOffset: Int, record: Array[Byte]): Array[Byte] = {
     val bytesCount = binaryProperties.dataSize
     val idx = itOffset
 
@@ -119,7 +213,7 @@ case class Primitive(
         return null
       }
     } else {
-      // Non-string field size should exactly fix the required bytes
+      // Non-string field size should exactly fit the required bytes
       if (idx + bytesCount > record.length) {
         return null
       }
@@ -132,9 +226,7 @@ case class Primitive(
     } else {
       bytesCount
     }
-    val bytes = java.util.Arrays.copyOfRange(record, idx, idx + bytesToCopy)
-
-    decode(bytes)
+    java.util.Arrays.copyOfRange(record, idx, idx + bytesToCopy)
   }
 
 }
@@ -19,8 +19,8 @@ package za.co.absa.cobrix.cobol.parser.ast.datatype
 
 sealed trait Usage
 
-//case class COMP() extends Usage
-//case class COMP0() extends Usage
+//case class COMP() extends Usage   // Use COMP4()
+//case class COMP0() extends Usage  // Use COMP4()
 case class COMP1() extends Usage {
   override def toString = "COMP-1"
 }
@@ -44,8 +44,8 @@ case class COMP5() extends Usage {
 case class COMP9() extends Usage {       // artificial little-endian binary
   override def toString = "COMP-9"
 }
-//case class DISPLAY() extends Usage {
+//case class DISPLAY() extends Usage { // Use None for the USAGE instead
 //  override def toString = "DISPLAY"
 //}
-//case class BINARY() extends Usage
+//case class BINARY() extends Usage  // Use COMP4()
 
@@ -64,6 +64,9 @@ object Constants {
   val recordIdField = "Record_Id"
   val recordByteLength = "Record_Byte_Length"
   val recordBytes = "Record_Bytes"
+  val corruptFieldsField = "_corrupt_fields"
+  val fieldNameColumn = "field_name"
+  val rawValueColumn = "raw_value"
 
   // Non-terminals
   val nonTerminalsPostfix = "_NT"

@@ -0,0 +1,22 @@
+/*
+ * Copyright 2018 ABSA Group Limited
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package za.co.absa.cobrix.cobol.reader.extractors.record
+
+case class CorruptField(
+                         fieldName: String,
+                         rawValue: Array[Byte]
+                       )