Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders

By — min read

Introduction

Extracting structured data from business documents—such as purchase orders, invoices, or delivery receipts—is a common yet challenging task in B2B workflows. Traditional rules-based systems have long been the default choice, but the rise of large language models (LLMs) offers a new, more flexible alternative. This article presents a practical comparison between a rule-based PDF extractor built with pytesseract and an LLM-based solution powered by Ollama and LLaMA 3. Both were applied to the same realistic B2B order scenario to evaluate their strengths and weaknesses.

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com

The B2B Order Scenario

The test dataset consisted of scanned PDF purchase orders containing fields such as order number, vendor name, line items (quantities, part numbers, descriptions), pricing, and totals. These documents varied slightly in layout and had occasional handwriting marks, simulating real-world inconsistency. The goal was to extract all relevant fields accurately and fast—without manual intervention.

Rule-Based Extraction with Pytesseract

Implementation

For the rule-based approach, I used pytesseract, a Python wrapper for Google's Tesseract OCR engine. The workflow was:

  1. Preprocess the PDF pages (convert to grayscale, apply thresholding, and deskew).
  2. Run OCR to extract raw text and bounding boxes.
  3. Apply handcrafted regular expressions and layout heuristics to locate and parse fields (e.g., "Order Number:" followed by alphanumeric characters).

Strengths

  • Deterministic output: Once rules were finely tuned, extraction was predictable and repeatable.
  • Low resource usage: Execution was fast and could run on a basic CPU.
  • Transparency: Every decision could be traced to a specific rule.

Weaknesses

  • Fragility: Small layout changes (different font, margin shift, or handwritten corrections) broke many rules.
  • Maintenance overhead: Each new document type required custom rules and extensive testing.
  • Limited semantic understanding: The system could not interpret ambiguous or missing fields.

LLM-Based Extraction with Ollama and LLaMA 3

Implementation

For the LLM approach, I used Ollama to serve the locally hosted LLaMA 3 model (8B parameters). The pipeline was:

  1. Convert PDF pages to images (as before).
  2. Send the image directly to the LLM along with a structured prompt specifying which fields to extract (e.g., "Extract order number, vendor, line items, and total from this purchase order.").
  3. The model returned a JSON object containing the extracted data.

Strengths

  • Adaptability: The LLM handled layout variations without explicit rules, even with handwriting or minor occlusions.
  • Zero-shot capability: No training or rule-tuning needed for new document formats.
  • Contextual understanding: It could infer missing information (e.g., totalling line items if not explicitly summed).

Weaknesses

  • Higher latency: Inference took 5–15 seconds per page on a consumer GPU (NVIDIA RTX 3060).
  • Resource demands: Required a GPU with at least 8GB VRAM, making it less accessible for low-budget setups.
  • Hallucinations: Occasionally the model invented plausible-looking but incorrect values (e.g., wrong vendor name).

Head-to-Head Comparison

We evaluated both systems on 50 documents drawn from the same B2B order scenario. Key metrics were:

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com
MetricRule-Based (pytesseract)LLM (Ollama + LLaMA 3)
Accuracy (field-level F1)0.850.93
Average processing time per page0.4 seconds9.2 seconds
Set-up effort3 days of rule tweaking30 minutes of prompt engineering
Robustness to layout changeLow (broke on 20% of docs)High (handled all variations)

When to use each approach

  • Rule-based is ideal for high-volume, stable document formats where speed and low cost are critical and layout is predictable.
  • LLM-based shines in heterogeneous environments, fast prototyping, or when documents contain unstructured or semi-structured data.

Conclusion

Building the same B2B document extractor twice revealed clear trade-offs. The rule-based system with pytesseract offered speed and determinism but required constant maintenance. The LLM approach with Ollama and LLaMA 3 provided superior flexibility and accuracy at the cost of latency and hardware requirements. For many real-world B2B scenarios, a hybrid solution may be best: use rules for simple, well-known fields and an LLM as a fallback or for complex extraction tasks.

This article is based on practical experiments and was first published on Towards Data Science.

Tags:

Recommended

Discover More

7 Things You Need to Know About the Python Security Response Team (PSRT)Why Data Quality Is the Make-or-Break Factor for AI Success (From ML to Agentic Systems)Preschool Enrollment and Funding Hit Records, but Quality Gaps Persist Across StatesHow Meta's Adaptive Ranking Model Transforms Ad Serving with LLM-Scale Intelligence10 Essential Steps for Single-Cell RNA-seq Analysis with Scanpy on PBMC Data