Skip to content

Comments

fix: coerce None text to empty string in Text element#4231

Open
themavik wants to merge 1 commit intoUnstructured-IO:mainfrom
themavik:fix-none-text-image-element
Open

fix: coerce None text to empty string in Text element#4231
themavik wants to merge 1 commit intoUnstructured-IO:mainfrom
themavik:fix-none-text-image-element

Conversation

@themavik
Copy link

Summary

Fixes #4084

When yolox detects a Picture element with bounding-box coordinates but no OCR text, the text attribute is set to None. When this element is normalized to an Image (which extends Text), the None value is preserved. Subsequent calls to str(element) then return None (the Python object, not the string "None"), violating the type contract (text: str) and causing errors in downstream processing.

Changes

In Text.__init__, coerce None to the empty string:

# Before
self.text: str = text

# After
self.text: str = text if text is not None else ""

This matches the existing defensive pattern in Element.__init__ which sets self.text = self.text if hasattr(self, "text") else "", and ensures __str__ always returns a valid string.

Test plan

  • Create an Image element with text=None and verify str(element) returns "" instead of None
  • Run partition_image with the yolox model on images that produce Picture annotations without text
  • Verify all existing tests pass

When a Picture element detected by yolox has only bounding-box
coordinates but no OCR text, the `text` attribute is set to `None`.
Normalizing this element to an `Image` (which extends `Text`) preserves
the `None` value.  Subsequent calls to `str(element)` then return
`None` instead of a string, violating the type contract (`text: str`)
and causing downstream errors.

Coerce `None` to the empty string in `Text.__init__` so that
`__str__` always returns a valid string.

Fixes Unstructured-IO#4084
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/None text attribute when normalizing Picture to Image element

1 participant