OCR_AGENT environment variable is ignored - API always uses Tesseract
Description
The OCR_AGENT environment variable is not respected by the API. Even when OCR_AGENT is set to use PaddleOCR (or any other OCR agent), the API always uses Tesseract because the ocr_agent parameter is never passed to the partition() function.
Steps to Reproduce
- Build a container with PaddleOCR installed:
FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest
RUN pip install "paddlepaddle>=3.0.0b1" "unstructured.paddleocr==2.10.0"
ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle
-
Start the container and send a PDF for processing
-
Observe that Tesseract is used instead of PaddleOCR:
$ docker exec unstructured ps aux | grep tesseract
tesseract /tmp/tess_xxx_input.PNG /tmp/tess_xxx -l eng -c tessedit_create_hocr=1
Expected Behavior
When OCR_AGENT environment variable is set, the API should use that OCR agent for processing.
Actual Behavior
The API ignores the OCR_AGENT environment variable and always uses Tesseract.
Root Cause
In prepline_general/api/general.py, the partition_kwargs dictionary does not include ocr_agent. The partition() function is called without this parameter, so it defaults to OCR_AGENT_TESSERACT.
The env_config.OCR_AGENT property correctly reads the environment variable, but it's never used when calling partition.
Current code (around line 580-600 in general.py):
partition_kwargs = {
"strategy": strategy,
"xml_keep_tags": xml_keep_tags,
"languages": languages,
# ... other params ...
# NOTE: ocr_agent is missing!
}
elements = partition(**partition_kwargs)
Proposed Fix
Add ocr_agent to partition_kwargs:
from unstructured.partition.utils.config import env_config
partition_kwargs = {
"strategy": strategy,
"ocr_agent": env_config.OCR_AGENT, # Add this line
"xml_keep_tags": xml_keep_tags,
# ... rest of params ...
}
Workaround
Patch the general.py file in the container:
RUN sed -i \
-e '1a from unstructured.partition.utils.config import env_config' \
-e 's/"strategy": strategy,/"strategy": strategy,\n "ocr_agent": env_config.OCR_AGENT,/' \
/home/notebook-user/prepline_general/api/general.py
Environment
- unstructured-api: latest (as of Dec 2024)
- unstructured: 0.18.18
- Docker image:
downloads.unstructured.io/unstructured-io/unstructured-api:latest
Additional Context
This issue affects anyone trying to use an alternative OCR agent (PaddleOCR, Google Vision OCR) with the self-hosted API. The documentation at https://docs.unstructured.io/open-source/core-functionality/set-the-ocr-agent suggests setting the OCR_AGENT environment variable, but this doesn't work with the API due to this bug.
OCR_AGENT environment variable is ignored - API always uses Tesseract
Description
The
OCR_AGENTenvironment variable is not respected by the API. Even whenOCR_AGENTis set to use PaddleOCR (or any other OCR agent), the API always uses Tesseract because theocr_agentparameter is never passed to thepartition()function.Steps to Reproduce
Start the container and send a PDF for processing
Observe that Tesseract is used instead of PaddleOCR:
Expected Behavior
When
OCR_AGENTenvironment variable is set, the API should use that OCR agent for processing.Actual Behavior
The API ignores the
OCR_AGENTenvironment variable and always uses Tesseract.Root Cause
In
prepline_general/api/general.py, thepartition_kwargsdictionary does not includeocr_agent. Thepartition()function is called without this parameter, so it defaults toOCR_AGENT_TESSERACT.The
env_config.OCR_AGENTproperty correctly reads the environment variable, but it's never used when calling partition.Current code (around line 580-600 in
general.py):Proposed Fix
Add
ocr_agenttopartition_kwargs:Workaround
Patch the
general.pyfile in the container:Environment
downloads.unstructured.io/unstructured-io/unstructured-api:latestAdditional Context
This issue affects anyone trying to use an alternative OCR agent (PaddleOCR, Google Vision OCR) with the self-hosted API. The documentation at https://docs.unstructured.io/open-source/core-functionality/set-the-ocr-agent suggests setting the
OCR_AGENTenvironment variable, but this doesn't work with the API due to this bug.