From 9182625db98a7d560fff1ed22798cf34958f71ca Mon Sep 17 00:00:00 2001 From: mrhapile Date: Tue, 3 Feb 2026 07:45:52 +0530 Subject: [PATCH] plugin-docs Signed-off-by: mrhapile --- docs/contribute/source/plugin/README.md | 287 ++++++++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 docs/contribute/source/plugin/README.md diff --git a/docs/contribute/source/plugin/README.md b/docs/contribute/source/plugin/README.md new file mode 100644 index 00000000..a0e8340f --- /dev/null +++ b/docs/contribute/source/plugin/README.md @@ -0,0 +1,287 @@ +# wasmedge_ocr Plugin + +The `wasmedge_ocr` plugin provides Optical Character Recognition (OCR) capabilities to WasmEdge applications by integrating with the Tesseract OCR engine. It allows WebAssembly modules to extract text and layout information from images located on the host filesystem. + +## Overview + +This plugin exposes host functions that enable Wasm modules to: +1. Trigger OCR processing on a specified image file. +2. Retrieve the results in TSV (Tab-Separated Values) format, which includes recognized text, confidence scores, and bounding box coordinates. + +### Quick Start + +Get the plugin running immediately with these steps. + +1. **Install Dependencies** + - **Linux (Ubuntu/Debian)**: + ```bash + sudo apt-get install libtesseract-dev libleptonica-dev tesseract-ocr-eng + ``` + - **macOS**: + ```bash + brew install tesseract leptonica + ``` + +2. **Build Plugin** + ```bash + # From WasmEdge root + cmake -DWASMEDGE_PLUGIN_WASMEDGE_OCR:BOOL=TRUE -B ./build -G "Unix Makefiles" + cmake --build ./build + ``` + +3. **Set Plugin Path** + ```bash + export WASMEDGE_PLUGIN_PATH=$(pwd)/build/plugins/wasmedge_ocr/ + ``` + +4. **Run** + ```bash + wasmedge app.wasm + ``` + +### Intended Use Cases +- Extracting text from artifacts (scanned documents, photos). +- Getting bounding box coordinates for text in images (layout analysis). +- processing images where the file resides on the host system. + +### Supported Image Formats +The plugin uses Leptonica for image loading, supporting formats such as: +- PNG +- JPEG +- TIFF +- BMP +- GIF +- WebP +- PNM + +## Architecture + +The plugin links against `libtesseract` and `libleptonica`. It exposes a module named `wasmedge_ocr` with two stateful host functions. + +### Integration +- **Direct Host Access**: The plugin follows a "host-passthrough" model where the Wasm module provides a file path string. The plugin reads this file directly from the **host filesystem**, effectively bypassing WASI file system sandboxing for the input file. +- **Stateful Execution**: The process requires a two-step call sequence: triggering the extraction and then fetching the result. +- **Output Format**: Hardcoded to return TSV data (level `RIL_WORD`), providing detailed word-level data. + +> [!WARNING] +> **Security Notice**: This plugin accesses files on the **Host Filesystem** using direct paths provided by the Wasm module. This explicitly **bypasses the WASI sandbox** isolation. Only use this plugin with properly reviewed and trusted Wasm modules, as they can probe for files on your host system. + +### Dependencies + +To build and run this plugin, the host system must have the development libraries for Tesseract and Leptonica installed, along with the English language data for Tesseract. + +#### Linux (Debian/Ubuntu) +```bash +sudo apt-get install libtesseract-dev libleptonica-dev tesseract-ocr-eng +``` + +#### macOS +```bash +brew install tesseract leptonica +``` + +#### Minimum Versions +- **Tesseract**: 4.x or higher (requires `libtesseract`). +- **Leptonica**: Compatible version with the installed Tesseract. + +## Build Instructions + +This plugin is built as part of the WasmEdge project source tree. + +### 1. Enable the Plugin in CMake +When configuring the WasmEdge build, you must enable the OCR plugin using the `WASMEDGE_PLUGIN_WASMEDGE_OCR` flag. + +```bash +cmake -DWASMEDGE_PLUGIN_WASMEDGE_OCR:BOOL=TRUE -B ./build -G "Unix Makefiles" +``` + +### 2. Build WasmEdge +```bash +cmake --build ./build +``` + +### 3. Verify Build +After building, check that the plugin library exists: +```bash +ls ./build/plugins/wasmedge_ocr/libwasmedgePluginWasmEdgeOCR.so +# On macOS, it will be .dylib +``` + +## API Reference + +**Module Name**: `wasmedge_ocr` + +### API Call Flow +This API is **stateful** and must be called in a specific order: +1. **Initialize & Process**: Call `num_of_extractions` with the image path. This initializes Tesseract and runs the recognition. +2. **Buffer Preparation**: The return value tells you how many bytes to allocate. +3. **Retrieve Data**: Call `get_output` to copy the data into your buffer. +4. **Cleanup**: `get_output` automatically calls `TesseractApi->End()`, cleaning up resources. You cannot call it again for the same image. + +### 1. `num_of_extractions` + +Triggers the OCR process on the image and returns the length of the result string. + +```wasm +(func $num_of_extractions (param i32 i32) (result i32)) +``` + +- **Parameters**: + - `image_path_ptr` (i32): Pointer to the null-terminated string containing the absolute or relative path to the image file on the host. + - `image_path_len` (i32): Length of the file path string. +- **Returns**: + - `(i32)`: The length (in bytes) of the generated TSV output string. Returns 0 or error code if it fails (internal logic usually returns length). + +### 2. `get_output` + +Retrieves the TSV output buffer generated by the previous call to `num_of_extractions`. + +```wasm +(func $get_output (param i32 i32) (result i32)) +``` + +- **Parameters**: + - `out_buf_ptr` (i32): Pointer to the memory buffer where the result should be written. + - `max_len` (i32): Maximum size of the buffer (should be at least the size returned by `num_of_extractions`). +- **Returns**: + - `(i32)`: Returns 0 (`ErrNo::Success`) on success. Returns error codes otherwise. +- **Side Effect**: + - **Clears State**: This function calls `TesseractApi->End()`, which cleans up the Tesseract instance. You cannot call `get_output` multiple times for the same extraction. + +### TSV Output Format +The `get_output` function returns raw TSV (Tab-Separated Values) data generated by Tesseract at the `RIL_WORD` (Word) level. + +| Column | Description | +| :--- | :--- | +| **level** | Hierarchy level (always 5 for Word) | +| **page_num** | Page number in the document | +| **block_num** | Block number | +| **par_num** | Paragraph number | +| **line_num** | Line number | +| **word_num** | Word number | +| **left** | X coordinate of the top-left corner | +| **top** | Y coordinate of the top-left corner | +| **width** | Width of the bounding box | +| **height** | Height of the bounding box | +| **conf** | Confidence score (0-100) | +| **text** | The recognized text string | + +## Usage Examples + +### Step-by-Step Workflow + +1. **Prepare Image**: Have an image file (e.g., `test.png`) on the host. +2. **Call `num_of_extractions`**: Pass the path to the image. Receive the result length. +3. **Allocate Memory**: Create a buffer of the received length. +4. **Call `get_output`**: Pass the buffer pointer and length to retrieve data. + +### Rust Example + +```rust +#[link(wasm_import_module = "wasmedge_ocr")] +extern "C" { + pub fn num_of_extractions(path_ptr: *const u8, path_len: usize) -> u32; + pub fn get_output(out_ptr: *mut u8, max_len: usize) -> u32; +} + +pub fn main() { + let image_path = "test.png"; + + unsafe { + // 1. Trigger OCR and get result length + let len = num_of_extractions(image_path.as_ptr(), image_path.len()); + + if len > 0 { + // 2. Allocate buffer + let mut buf = vec![0u8; len as usize]; + + // 3. Retrieve output + let res = get_output(buf.as_mut_ptr(), len as usize); + + if res == 0 { + let output = String::from_utf8_lossy(&buf); + println!("OCR Result (TSV):\n{}", output); + } else { + eprintln!("Failed to get output, error code: {}", res); + } + } + } +} +``` + +### Execution + +Run the compiled Wasm file using the WasmEdge CLI with the plugin paths set. + +```bash +# Set plugin path if installed in a custom location, otherwise default is used +export WASMEDGE_PLUGIN_PATH=./build/plugins/wasmedge_ocr/ + +# Run the wasm file +wasmedge app.wasm +``` + +## Performance & Limitations + +### 1. Language Support +- **English Only**: The plugin hardcodes the initialization language to `"eng"`. It requires `tesseract-ocr-eng` data to be present on the host. Multi-language or custom trained data selection is **not** currently exposed via the API. + +### 2. Output Format +- **TSV Fixed**: The output is strictly Tesseract's TSV format (Tab-Separated Values). It contains word confidence, bounding boxes, and text. Plain text extraction is not directly provided as a separate option; user must parse the TSV. + +### 3. One-Shot Lifecycle +- The `get_output` function calls `TesseractApi->End()`. This destroys the Tesseract instance associated with the module environment. +- **Implication**: If you need to process multiple images, the current implementation might require re-initializing the module or might fail if the environment does not re-initialize automatically (the current code only initializes in the constructor). A safe approach is to treat the module instance as single-use or test behavior for sequential calls carefully. + +### 4. File Access +- The plugin uses `pixRead` with the path provided. This file must exist on the **Host Filesystem** where WasmEdge is running. It does not read from the Wasm virtual filesystem (WASI). If running in a container, map the image file into the container. + +### Compatibility Matrix + +| Feature | Support | Notes | +| :--- | :--- | :--- | +| **Interpreter Mode** | ✅ Supported | Standard execution | +| **AOT / JIT** | ✅ Supported | Validated on x86_64, aarch64 | +| **WASI Filesystem** | ❌ Not Supported | Files read directly from Host FS | +| **Host OS** | Linux, macOS | Windows support experimental/untested | + +## Common Pitfalls + +* **Calling `get_output` twice**: Will cause a crash or undefined behavior because the Tesseract instance is destroyed after the first call. +* **Reuse of Module Instance**: The module is designed for single-use per Tesseract session. Re-instantiate the module for processing new images if you encounter issues. +* **Relative Paths**: Paths are relative to the *working directory of the `wasmedge` process*, not the Wasm file location. +* **Missing Data**: Forgetting to install `tesseract-ocr-eng` will cause silent initialization failures (calls return 0). + +## Troubleshooting + +### Common Build Failures + +**1. `Tesseract` or `Leptonica` not found** +- **Error**: `Could NOT find Tesseract (missing: TESSERACT_LIBRARIES TESSERACT_INCLUDE_DIRS)` +- **Fix**: Ensure development headers are installed. + - Ubuntu: `sudo apt install libtesseract-dev libleptonica-dev` + - macOS: `brew install tesseract leptonica` +- **Fix**: If using custom paths, set `PKG_CONFIG_PATH` to help CMake find the libraries. + +### Runtime Failure Modes + +**1. Initialization Error (Error Code 1 or 2)** +- **Symptom**: `num_of_extractions` returns 0 or logs `[WasmEdge-OCR] Error occurred when initializing tesseract.` +- **Cause**: Missing `tessdata` (specifically `eng.traineddata`). +- **Fix**: Install the language data packages (`sudo apt install tesseract-ocr-eng`) or set the `TESSDATA_PREFIX` environment variable to the directory containing `eng.traineddata`. + +**2. File Not Found** +- **Symptom**: `pixRead` fails, `num_of_extractions` might return 0 or unexpected length. +- **Cause**: The path provided is relative to the *host's* current working directory, not the Wasm file location. +- **Fix**: Use absolute paths for images or ensure the WasmEdge runner is executed from the correct directory. + +**3. "Symbol not found" when running Wasm** +- **Cause**: The plugin is not loaded. +- **Fix**: Ensure the `WASMEDGE_PLUGIN_PATH` environment variable points to the directory containing `libwasmedgePluginWasmEdgeOCR.so` (or `.dylib`). + +## Future Improvements + +* Expose Tesseract language selection via API. +* Implement direct memory buffer support (passing image bytes instead of paths). +* Add support for other output formats (text, HOCR, PDF). +* Allow re-initialization of Tesseract engine without destroying module instance.