Scaling Multilingual DTP by Automating Mandatory Text Checks in PDFs Before Delivery

 

In multilingual documentation workflows, the final deliverable is often a PDF.

Even when layout, links, and styles are correct, a document can still fail for a very simple reason:

a required text string is missing.

Typical examples include:

  • Safety headings such as Warning

  • Regulatory sections such as Intended Use

  • Copyright statements

  • Document or revision codes

These are easy to overlook—especially when you deliver many PDFs across multiple languages.


Why I built this script in the first place

In my experience, many PDF proofreading or QA tools are good at finding forbidden words (for example, detecting if Draft is still present).

However, tools that can reliably detect when a required text is missing are surprisingly rare.

When such tools do exist:

  • they are often enterprise-level solutions

  • and they are usually very expensive

That gap is what motivated me to create a small, purpose-built script that answers a simple question:

“Is this required text actually present in every PDF?”


The automation approach

The basic idea is straightforward:

  1. Extract text from each PDF

  2. Search for a predefined list of required texts

  3. Report which PDFs pass or fail

  4. Save the results as a log file for traceability

This does not replace full QA.
It provides a fast, repeatable safety net before delivery.


Minimal script example (folder-based batch check)

Important note for readers
This example is written in Python, not JavaScript.

If you mainly work with InDesign scripting, this is intentional:

  • InDesign scripts are ideal for layout and asset automation

  • PDF-level QA is often easier and more reliable in Python

To run this script, you need:

  • Python 3.10 or later

  • The following modules:

    • PyMuPDF (fitz)

    • tkinter (usually included with Python)

Module installation example:

pip install pymupdf

Example required texts (sample only)

NEEDLES = [ "Warning", "Intended Use", "Copyright", "1234-5678A", ]

These strings are intentionally generic and represent typical mandatory elements in technical or regulated documentation.


Python sample script (MIT License)

from __future__ import annotations from pathlib import Path import re import time import fitz # PyMuPDF import tkinter as tk from tkinter import filedialog # Sample mandatory strings (English-only example) NEEDLES = [ "Warning", "Intended Use", "Copyright", "1234-5678A", ] LOG_FILENAME = "pdf_mandatory_text_check_results.txt" def normalize(s: str) -> str: """ Minimal normalization for robust matching: - lowercase comparison - collapsed whitespace """ s = s.lower() s = re.sub(r"\s+", " ", s).strip() return s def select_target_dir() -> Path | None: root = tk.Tk() root.withdraw() root.attributes("-topmost", True) folder = filedialog.askdirectory( title="Select a folder containing PDFs" ) root.destroy() return Path(folder) if folder else None def pdf_contains_all_text( pdf_path: Path, needles: list[str], ) -> tuple[bool, list[str], str | None]: """ Returns: ok: bool missing: list[str] error: str | None """ try: doc = fitz.open(pdf_path) texts: list[str] = [] for page in doc: texts.append(page.get_text("text")) all_text = normalize("\n".join(texts)) except Exception as e: return False, needles, f"{type(e).__name__}: {e}" missing: list[str] = [] for needle in needles: if normalize(needle) not in all_text: missing.append(needle) return len(missing) == 0, missing, None def main() -> int: target_dir = select_target_dir() if target_dir is None: print("Cancelled (no folder selected).") time.sleep(5) return 0 pdfs = sorted(target_dir.glob("*.pdf")) if not pdfs: print("No PDFs found in the selected folder.") time.sleep(5) return 0 log_path = target_dir / LOG_FILENAME lines: list[str] = [] lines.append("PDF Mandatory Text Check Results") lines.append(f"Folder: {target_dir}") lines.append(f"Mandatory strings: {NEEDLES}") lines.append("") has_ng = False print("Results:") for pdf in pdfs: ok, missing, error = pdf_contains_all_text(pdf, NEEDLES) if error: has_ng = True msg = f"{pdf.name}: ERROR {error}" elif ok: msg = f"{pdf.name}: OK" else: has_ng = True msg = f"{pdf.name}: NG Missing: {missing}" print(msg) lines.append(msg) log_path.write_text( "\n".join(lines) + "\n", encoding="utf-8" ) print(f"\nLog written to: {log_path}") print("\nProcess completed.") print("This window will close automatically in 5 seconds...") time.sleep(5) return 1 if has_ng else 0 if __name__ == "__main__": raise SystemExit(main())

Note:
This script is intentionally simplified and provided as a sample.
Because the required texts are defined in English, it cannot be used as-is for full multilingual validation.
In production, the validation logic itself is typically shared, while the rule sets are switched automatically by detecting the target language from the file name or from the PDF content.

What I add in real production workflows

In real projects, this kind of checker is usually extended with:

  • Language-specific required text sets

  • CSV reports for PMs and QA teams

  • Additional PDF checks (file naming, etc)

This article intentionally focuses on the simplest reliable starting point.


Related articles

If you are interested in scaling multilingual DTP workflows, you may also find these useful:

Popular Posts