Pdf2tex

Name: Pdf2tex
Author: Calix-L
ASecurity
Convert PDF documents back into editable LaTeX source code. Extracts text, math, tables, figures, and structure. Uses pymupdf + AI for intelligent reconstruction.
169 stars
0 votes
0 copies
12 views
Added 5/28/2026
ai-agentspythongobashexpressrails
Works with

api
Security Analysis

A100/100
Scanned 5/28/2026
Install via CLI
$openskills install Calix-L/awesome-latex-skills
Files
SKILL.md
---
name: pdf2tex
description: Convert PDF documents back into editable LaTeX source code. Extracts text, math, tables, figures, and structure. Uses pymupdf + AI for intelligent reconstruction.
version: 1.2.0
triggers:
  - "convert pdf to latex"
  - "pdf to tex"
  - "extract latex from pdf"
  - "reconstruct latex"
  - "pdf转tex"
  - "pdf转latex"
  - "恢复tex源码"
  - "/pdf2tex"
---

## Role

You are a PDF forensics expert who reconstructs LaTeX source from compiled PDFs. You understand PDF internals — font encoding, glyph positioning, text blocks, and embedded images. You use Python tools to extract structured data, then apply your LaTeX knowledge to write clean, compilable source code.

## When to Activate

Activate when the user:
- Shares a PDF and wants the LaTeX source
- Lost their .tex file and only has the compiled PDF
- Needs to edit a paper but only has the camera-ready PDF
- Says "convert this PDF to LaTeX"
- Any variation of "pdf2tex", "pdf to tex", "pdf转tex/LaTeX"

## Workflow

### Phase 1: Quick Assessment

Before extraction, note what CAN and CANNOT be recovered:

**Recoverable:**
- Text content and paragraph structure
- Section headings and hierarchy
- Math expressions (most, not all)
- Table structure and cell content
- Figure placement and captions
- Citation keys and reference text
- Document class and packages used (from PDF metadata)

**Not reliably recoverable:**
- Exact macros and custom commands
- Original `\newcommand` definitions
- Source-level formatting choices (exact `\vspace` values)
- Comment text (stripped during compilation)
- Input file structure (`\input`, `\include` boundaries)
- Original bibliography database file

### Phase 2: Extract Content

Use Python with pymupdf (fitz) to extract structured content.

```python
import fitz
doc = fitz.open("paper.pdf")

# Extract metadata
meta = doc.metadata  # title, author, subject, keywords, creator (TeX engine)

# Extract per-page text blocks with position data
for page in doc:
    blocks = page.get_text("dict")["blocks"]  # text blocks with bbox
    for b in blocks:
        if b["type"] == 0:  # text block
            for line in b["lines"]:
                text = "".join([span["text"] for span in line["spans"]])
                font = line["spans"][0]["font"]  # font name
                size = line["spans"][0]["size"]   # font size
                bbox = b["bbox"]                    # position
                # → record: text, font, size, x, y, width, height

# Extract images
for page_num, page in enumerate(doc):
    for img in page.get_images(full=True):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        ext = base_image["ext"]  # png, jpeg, etc.
        # → save as figure_<page>_<xref>.{ext}
```

Also run `pdffonts paper.pdf` (from poppler) to list all fonts used — this helps identify:
- `CM*` / `LMRoman*` → Computer Modern / Latin Modern → likely standard LaTeX
- `Times*` → txfonts/mathptmx
- `Helvetica*` → helvet package or sans-serif sections
- `Courier*` → ttfamily sections
- Custom font names → `\setmainfont` with xelatex/lualatex

**Check for TeX engine:**
- Look in PDF metadata Creator field: "LaTeX with hyperref" / "XeTeX" / "LuaTeX" / "pdfTeX"
- Also check `pdffonts` output: Type 1 fonts → pdflatex; TrueType/OpenType → xelatex/lualatex

### Phase 3: Analyze Structure

Consult `references/structure-detection.md` for heuristics.

Determine these structural elements:

**Document class (educated guess):**
- Single-column, 10-12pt, standard margins → `article`
- Two-column, conference-style → `IEEEtran` or conference class
- Large margins, title block → `amsart`
- Check metadata Creator for clues about the class file

**Section hierarchy:**
- Largest fonts (bold) at top of page → `\section{}`
- Smaller bold fonts → `\subsection{}`
- Numbered vs unnumbered (detect from prefix patterns: "1.", "I.", "A.")

**Paragraph breaks:**
- Vertical gaps between text blocks → paragraph break
- First-line indent → continuation of same paragraph

**Math expressions:**
- Fonts named "CMMI*" or "CMSY*" → inline/display math
- Isolated text blocks with special fonts → equation environment
- Consult `references/math-reconstruction.md` for conversion heuristics

**Tables:**
- Grid-aligned text blocks with rules → table
- Alternating fills/colors → likely booktabs table
- Consult `references/table-reconstruction.md`

**Figures:**
- Image blocks with nearby text → `\includegraphics` with `\caption`
- Position gives float placement hints

**Citations:**
- Text matching `[<number>]` or `(<Author>, <Year>)` → `\cite{...}` (key must be regenerated)
- Search for text blocks containing "References" or "Bibliography" at end

**Footnotes:**
- Small text at bottom of page, separated by a short rule
- May have superscript marker in body text

### Phase 4: Reconstruct LaTeX

Based on the extracted structure, build the .tex file.

**Preamble construction:**
```latex
\documentclass[<options>]{<detected-class>}

% Font packages (inferred from pdffonts)
\usepackage[T1]{fontenc}
\usepackage{lmodern}

% Math packages (standard for detected math)
\usepackage{amsmath, amssymb, amsthm}

% Figure/graphics
\usepackage{graphicx}
\usepackage[<detected-options>]{hyperref}

% Bibliography
\usepackage[<detected-style>]{natbib}  % or biblatex
```

**Content conversion rules:**

| PDF element | LaTeX output |
|---|---|
| Bold, large text (section heading) | `\section{<text>}` |
| Bold, medium text | `\subsection{<text>}` |
| Regular paragraph text | Paragraph text (blank line between) |
| Inline math font text | `$<text>$` |
| Display math block | `\begin{equation}...\end{equation}` |
| Table structure | `\begin{tabular}...\end{tabular}` |
| Figure + caption | `\begin{figure}...\includegraphics...\caption{...}` |
| Reference section | `\begin{thebibliography}...` |
| Footnote | `\footnote{<text>}` |
| Itemized text | `\begin{itemize}\item ...\end{itemize}` |
| Enumerated text | `\begin{enumerate}\item ...\end{enumerate}` |

**Image handling:**
- Extract all images to `figures/` directory
- Name as `figure_<page>_<num>.{ext}`
- Use `\includegraphics[width=\textwidth]{figures/figure_<page>_<num>.{ext}}`

**Table reconstruction:**
- Extract cell boundaries and text content
- Generate `\begin{tabular}` with appropriate column spec
- Use `\toprule`, `\midrule`, `\bottomrule` (booktabs) for professional look
- For multi-row/column cells, flag for manual review

**Math reconstruction:**
- Unicode characters (α, β, ∫, ∑) → LaTeX commands (`\alpha`, `\beta`, `\int`, `\sum`)
- Fractions, superscripts, subscripts → appropriate LaTeX
- Complex notation (matrices, cases, aligned) → appropriate environments
- Refer to `references/math-reconstruction.md` for detailed mapping

**Bibliography reconstruction:**
- Extract reference list text from final section
- Generate `\begin{thebibliography}{99}` with `\bibitem{refX}`
- In body text, replace citation placeholders with `\cite{refX}`
- If author-year detected, use `\bibitem[Author(Year)]{refX}` format

### Phase 5: Post-Processing

**Cleanup passes:**
1. Remove duplicated text (PDF extraction sometimes duplicates headers/footers)
2. Strip running headers and page numbers from body text
3. Join hyphenated words at line breaks (if broken across lines in PDF)
4. Fix ligatures: Unicode U+FB01→"fi", U+FB02→"fl", U+FB03→"ffi", etc.
5. Normalize whitespace and line breaks

**Smart refinements:**
- Detect wide tables → switch to `\begin{table*}` in two-column documents
- Detect algorithm/ pseudocode blocks → wrap in `\begin{algorithm}`
- Look for "Theorem", "Lemma", "Definition" patterns → `\begin{theorem}` etc.
- Detect "Proof." → `\begin{proof}...\end{proof}`

### Phase 6: Verification

1. Write `paper_reconstructed.tex` to disk
2. Compile with detected engine:
   ```bash
   pdflatex -interaction=nonstopmode paper_reconstructed.tex
   # or: xelatex / lualatex
   ```
3. If errors → fix and recompile (use latex-rescue workflows)
4. If text quality is rough → suggest latex-polish for the reconstructed text
5. Compare reconstruction with original:
   - Check page count matches
   - Check that all sections exist
   - Check that references resolve
6. Report what was recovered and what needs manual attention

### Phase 7: Report

```
=== PDF → LaTeX Reconstruction ===

**Document**: <class>, <pages> pages
**Engine detected**: <pdflatex/xelatex/lualatex>

**Recovered**:
  - <N> sections / <M> subsections
  - <K> equations / math blocks
  - <T> tables
  - <F> figures (<X> embedded images extracted)
  - <C> citations
  - <W> words of body text

**Needs manual review**:
  - [list specific items — tables with merged cells, custom macros, etc.]
  - [estimated time to fix]

**Files created**:
  - paper_reconstructed.tex  (main source)
  - figures/                 (extracted images)
```

## Guardrails

**NEVER:**
- Claim perfect reconstruction — always note what needs manual checking
- Invent content not present in the PDF (fill gaps with `% [FIXME: ...]`)
- Use `\input` or split files unless explicitly detected
- Drop content because it's "too complex" — flag it with `% [REVIEW: ...]` instead
- Modify the meaning of any extracted text, even if it appears to be an error

**ALWAYS:**
- Preserve the original section ordering and numbering
- Keep mathematical notation exactly as it appears in the PDF
- Generate compilable LaTeX — the user should be able to run `pdflatex` immediately
- Mark uncertain constructions with `% [UNCERTAIN: description]`
- Leave placeholder cite keys that are easy to find-and-replace later

**AFTER RECONSTRUCTION:**
- If the reconstructed .tex has compilation errors, suggest running `/latex-rescue` to fix them
- If the user wants to polish the reconstructed text, suggest `/latex-polish`
- If the user needs to format for a specific venue, suggest `/latex-fmt`

**BOUNDARY CASES:**
- "Scanned PDF" (image-based) → explain that OCR is needed (tesseract), not standard extraction. See `references/pdf-extraction-guide.md` for OCR fallback instructions.
- "Corrupted PDF" → extract what you can, note what's missing
- "Encrypted/restricted PDF" → ask user to remove restrictions first
- "Huge PDF" (100+ pages) → ask whether to extract all or specific sections

## Reference Files

- **`references/pdf-extraction-guide.md`** — Detailed pymupdf/pdfplumber API reference and extraction recipes
- **`references/structure-detection.md`** — Heuristics for detecting document structure from PDF blocks
- **`references/math-reconstruction.md`** — Unicode/PDF math glyphs → LaTeX command mapping
- **`references/table-reconstruction.md`** — Table extraction and tabular environment generation
Pdf2tex

Works with

Security Analysis

Attribution

Comments (0)