Scaling Multilingual DTP by Automating Mandatory Text Checks in PDFs Before Delivery

January 21, 2026

Scaling Multilingual DTP by Automating Mandatory Text Checks in PDFs Before Delivery

In multilingual documentation workflows, the final deliverable is often a PDF.

Even when layout, links, and styles are correct, a document can still fail for a very simple reason:

a required text string is missing.

Typical examples include:

Safety headings such as Warning
Regulatory sections such as Intended Use
Copyright statements
Document or revision codes

These are easy to overlook—especially when you deliver many PDFs across multiple languages.

Why I built this script in the first place

In my experience, many PDF proofreading or QA tools are good at finding forbidden words (for example, detecting if Draft is still present).

However, tools that can reliably detect when a required text is missing are surprisingly rare.

When such tools do exist:

they are often enterprise-level solutions
and they are usually very expensive

That gap is what motivated me to create a small, purpose-built script that answers a simple question:

“Is this required text actually present in every PDF?”

The automation approach

The basic idea is straightforward:

Extract text from each PDF
Search for a predefined list of required texts
Report which PDFs pass or fail
Save the results as a log file for traceability

This does not replace full QA.
It provides a fast, repeatable safety net before delivery.

Minimal script example (folder-based batch check)

⚠ Important note for readers
This example is written in Python, not JavaScript.

If you mainly work with InDesign scripting, this is intentional:

InDesign scripts are ideal for layout and asset automation
PDF-level QA is often easier and more reliable in Python

To run this script, you need:

Python 3.10 or later
The following modules:
- PyMuPDF (fitz)
- tkinter (usually included with Python)

Module installation example:


pip install pymupdf

Example required texts (sample only)


NEEDLES = [
    "Warning",
    "Intended Use",
    "Copyright",
    "1234-5678A",
]

These strings are intentionally generic and represent typical mandatory elements in technical or regulated documentation.

Python sample script (MIT License)


from __future__ import annotations

from pathlib import Path
import re
import time
import fitz  # PyMuPDF
import tkinter as tk
from tkinter import filedialog

# Sample mandatory strings (English-only example)
NEEDLES = [
    "Warning",
    "Intended Use",
    "Copyright",
    "1234-5678A",
]

LOG_FILENAME = "pdf_mandatory_text_check_results.txt"


def normalize(s: str) -> str:
    """
    Minimal normalization for robust matching:
    - lowercase comparison
    - collapsed whitespace
    """
    s = s.lower()
    s = re.sub(r"\s+", " ", s).strip()
    return s


def select_target_dir() -> Path | None:
    root = tk.Tk()
    root.withdraw()
    root.attributes("-topmost", True)

    folder = filedialog.askdirectory(
        title="Select a folder containing PDFs"
    )
    root.destroy()

    return Path(folder) if folder else None


def pdf_contains_all_text(
    pdf_path: Path,
    needles: list[str],
) -> tuple[bool, list[str], str | None]:
    """
    Returns:
      ok: bool
      missing: list[str]
      error: str | None
    """
    try:
        doc = fitz.open(pdf_path)
        texts: list[str] = []
        for page in doc:
            texts.append(page.get_text("text"))
        all_text = normalize("\n".join(texts))
    except Exception as e:
        return False, needles, f"{type(e).__name__}: {e}"

    missing: list[str] = []
    for needle in needles:
        if normalize(needle) not in all_text:
            missing.append(needle)

    return len(missing) == 0, missing, None


def main() -> int:
    target_dir = select_target_dir()
    if target_dir is None:
        print("Cancelled (no folder selected).")
        time.sleep(5)
        return 0

    pdfs = sorted(target_dir.glob("*.pdf"))
    if not pdfs:
        print("No PDFs found in the selected folder.")
        time.sleep(5)
        return 0

    log_path = target_dir / LOG_FILENAME
    lines: list[str] = []

    lines.append("PDF Mandatory Text Check Results")
    lines.append(f"Folder: {target_dir}")
    lines.append(f"Mandatory strings: {NEEDLES}")
    lines.append("")

    has_ng = False

    print("Results:")
    for pdf in pdfs:
        ok, missing, error = pdf_contains_all_text(pdf, NEEDLES)

        if error:
            has_ng = True
            msg = f"{pdf.name}: ERROR  {error}"
        elif ok:
            msg = f"{pdf.name}: OK"
        else:
            has_ng = True
            msg = f"{pdf.name}: NG  Missing: {missing}"

        print(msg)
        lines.append(msg)

    log_path.write_text(
        "\n".join(lines) + "\n",
        encoding="utf-8"
    )

    print(f"\nLog written to: {log_path}")
    print("\nProcess completed.")
    print("This window will close automatically in 5 seconds...")
    time.sleep(5)

    return 1 if has_ng else 0


if __name__ == "__main__":
    raise SystemExit(main())
Note:
This script is intentionally simplified and provided as a sample.
Because the required texts are defined in English, it cannot be used as-is for full multilingual validation.
In production, the validation logic itself is typically shared, while the rule sets are switched automatically by detecting the target language from the file name or from the PDF content.

What I add in real production workflows

In real projects, this kind of checker is usually extended with:

Language-specific required text sets
CSV reports for PMs and QA teams
Additional PDF checks (file naming, etc)

This article intentionally focuses on the simplest reliable starting point.

If you are interested in scaling multilingual DTP workflows, you may also find these useful:

Search This Blog

Structural Automation for Multilingual Production