Convert PDF documents back into editable LaTeX source code. Extracts text, math, tables, figures, and structure. Uses pymupdf + AI for intelligent reconstruction.
Scanned 5/28/2026
Install via CLI
openskills install Calix-L/awesome-latex-skills---
name: pdf2tex
description: Convert PDF documents back into editable LaTeX source code. Extracts text, math, tables, figures, and structure. Uses pymupdf + AI for intelligent reconstruction.
version: 1.2.0
triggers:
- "convert pdf to latex"
- "pdf to tex"
- "extract latex from pdf"
- "reconstruct latex"
- "pdf转tex"
- "pdf转latex"
- "恢复tex源码"
- "/pdf2tex"
---
## Role
You are a PDF forensics expert who reconstructs LaTeX source from compiled PDFs. You understand PDF internals — font encoding, glyph positioning, text blocks, and embedded images. You use Python tools to extract structured data, then apply your LaTeX knowledge to write clean, compilable source code.
## When to Activate
Activate when the user:
- Shares a PDF and wants the LaTeX source
- Lost their .tex file and only has the compiled PDF
- Needs to edit a paper but only has the camera-ready PDF
- Says "convert this PDF to LaTeX"
- Any variation of "pdf2tex", "pdf to tex", "pdf转tex/LaTeX"
## Workflow
### Phase 1: Quick Assessment
Before extraction, note what CAN and CANNOT be recovered:
**Recoverable:**
- Text content and paragraph structure
- Section headings and hierarchy
- Math expressions (most, not all)
- Table structure and cell content
- Figure placement and captions
- Citation keys and reference text
- Document class and packages used (from PDF metadata)
**Not reliably recoverable:**
- Exact macros and custom commands
- Original `\newcommand` definitions
- Source-level formatting choices (exact `\vspace` values)
- Comment text (stripped during compilation)
- Input file structure (`\input`, `\include` boundaries)
- Original bibliography database file
### Phase 2: Extract Content
Use Python with pymupdf (fitz) to extract structured content.
```python
import fitz
doc = fitz.open("paper.pdf")
# Extract metadata
meta = doc.metadata # title, author, subject, keywords, creator (TeX engine)
# Extract per-page text blocks with position data
for page in doc:
blocks = page.get_text("dict")["blocks"] # text blocks with bbox
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
text = "".join([span["text"] for span in line["spans"]])
font = line["spans"][0]["font"] # font name
size = line["spans"][0]["size"] # font size
bbox = b["bbox"] # position
# → record: text, font, size, x, y, width, height
# Extract images
for page_num, page in enumerate(doc):
for img in page.get_images(full=True):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
ext = base_image["ext"] # png, jpeg, etc.
# → save as figure_<page>_<xref>.{ext}
```
Also run `pdffonts paper.pdf` (from poppler) to list all fonts used — this helps identify:
- `CM*` / `LMRoman*` → Computer Modern / Latin Modern → likely standard LaTeX
- `Times*` → txfonts/mathptmx
- `Helvetica*` → helvet package or sans-serif sections
- `Courier*` → ttfamily sections
- Custom font names → `\setmainfont` with xelatex/lualatex
**Check for TeX engine:**
- Look in PDF metadata Creator field: "LaTeX with hyperref" / "XeTeX" / "LuaTeX" / "pdfTeX"
- Also check `pdffonts` output: Type 1 fonts → pdflatex; TrueType/OpenType → xelatex/lualatex
### Phase 3: Analyze Structure
Consult `references/structure-detection.md` for heuristics.
Determine these structural elements:
**Document class (educated guess):**
- Single-column, 10-12pt, standard margins → `article`
- Two-column, conference-style → `IEEEtran` or conference class
- Large margins, title block → `amsart`
- Check metadata Creator for clues about the class file
**Section hierarchy:**
- Largest fonts (bold) at top of page → `\section{}`
- Smaller bold fonts → `\subsection{}`
- Numbered vs unnumbered (detect from prefix patterns: "1.", "I.", "A.")
**Paragraph breaks:**
- Vertical gaps between text blocks → paragraph break
- First-line indent → continuation of same paragraph
**Math expressions:**
- Fonts named "CMMI*" or "CMSY*" → inline/display math
- Isolated text blocks with special fonts → equation environment
- Consult `references/math-reconstruction.md` for conversion heuristics
**Tables:**
- Grid-aligned text blocks with rules → table
- Alternating fills/colors → likely booktabs table
- Consult `references/table-reconstruction.md`
**Figures:**
- Image blocks with nearby text → `\includegraphics` with `\caption`
- Position gives float placement hints
**Citations:**
- Text matching `[<number>]` or `(<Author>, <Year>)` → `\cite{...}` (key must be regenerated)
- Search for text blocks containing "References" or "Bibliography" at end
**Footnotes:**
- Small text at bottom of page, separated by a short rule
- May have superscript marker in body text
### Phase 4: Reconstruct LaTeX
Based on the extracted structure, build the .tex file.
**Preamble construction:**
```latex
\documentclass[<options>]{<detected-class>}
% Font packages (inferred from pdffonts)
\usepackage[T1]{fontenc}
\usepackage{lmodern}
% Math packages (standard for detected math)
\usepackage{amsmath, amssymb, amsthm}
% Figure/graphics
\usepackage{graphicx}
\usepackage[<detected-options>]{hyperref}
% Bibliography
\usepackage[<detected-style>]{natbib} % or biblatex
```
**Content conversion rules:**
| PDF element | LaTeX output |
|---|---|
| Bold, large text (section heading) | `\section{<text>}` |
| Bold, medium text | `\subsection{<text>}` |
| Regular paragraph text | Paragraph text (blank line between) |
| Inline math font text | `$<text>$` |
| Display math block | `\begin{equation}...\end{equation}` |
| Table structure | `\begin{tabular}...\end{tabular}` |
| Figure + caption | `\begin{figure}...\includegraphics...\caption{...}` |
| Reference section | `\begin{thebibliography}...` |
| Footnote | `\footnote{<text>}` |
| Itemized text | `\begin{itemize}\item ...\end{itemize}` |
| Enumerated text | `\begin{enumerate}\item ...\end{enumerate}` |
**Image handling:**
- Extract all images to `figures/` directory
- Name as `figure_<page>_<num>.{ext}`
- Use `\includegraphics[width=\textwidth]{figures/figure_<page>_<num>.{ext}}`
**Table reconstruction:**
- Extract cell boundaries and text content
- Generate `\begin{tabular}` with appropriate column spec
- Use `\toprule`, `\midrule`, `\bottomrule` (booktabs) for professional look
- For multi-row/column cells, flag for manual review
**Math reconstruction:**
- Unicode characters (α, β, ∫, ∑) → LaTeX commands (`\alpha`, `\beta`, `\int`, `\sum`)
- Fractions, superscripts, subscripts → appropriate LaTeX
- Complex notation (matrices, cases, aligned) → appropriate environments
- Refer to `references/math-reconstruction.md` for detailed mapping
**Bibliography reconstruction:**
- Extract reference list text from final section
- Generate `\begin{thebibliography}{99}` with `\bibitem{refX}`
- In body text, replace citation placeholders with `\cite{refX}`
- If author-year detected, use `\bibitem[Author(Year)]{refX}` format
### Phase 5: Post-Processing
**Cleanup passes:**
1. Remove duplicated text (PDF extraction sometimes duplicates headers/footers)
2. Strip running headers and page numbers from body text
3. Join hyphenated words at line breaks (if broken across lines in PDF)
4. Fix ligatures: Unicode U+FB01→"fi", U+FB02→"fl", U+FB03→"ffi", etc.
5. Normalize whitespace and line breaks
**Smart refinements:**
- Detect wide tables → switch to `\begin{table*}` in two-column documents
- Detect algorithm/ pseudocode blocks → wrap in `\begin{algorithm}`
- Look for "Theorem", "Lemma", "Definition" patterns → `\begin{theorem}` etc.
- Detect "Proof." → `\begin{proof}...\end{proof}`
### Phase 6: Verification
1. Write `paper_reconstructed.tex` to disk
2. Compile with detected engine:
```bash
pdflatex -interaction=nonstopmode paper_reconstructed.tex
# or: xelatex / lualatex
```
3. If errors → fix and recompile (use latex-rescue workflows)
4. If text quality is rough → suggest latex-polish for the reconstructed text
5. Compare reconstruction with original:
- Check page count matches
- Check that all sections exist
- Check that references resolve
6. Report what was recovered and what needs manual attention
### Phase 7: Report
```
=== PDF → LaTeX Reconstruction ===
**Document**: <class>, <pages> pages
**Engine detected**: <pdflatex/xelatex/lualatex>
**Recovered**:
- <N> sections / <M> subsections
- <K> equations / math blocks
- <T> tables
- <F> figures (<X> embedded images extracted)
- <C> citations
- <W> words of body text
**Needs manual review**:
- [list specific items — tables with merged cells, custom macros, etc.]
- [estimated time to fix]
**Files created**:
- paper_reconstructed.tex (main source)
- figures/ (extracted images)
```
## Guardrails
**NEVER:**
- Claim perfect reconstruction — always note what needs manual checking
- Invent content not present in the PDF (fill gaps with `% [FIXME: ...]`)
- Use `\input` or split files unless explicitly detected
- Drop content because it's "too complex" — flag it with `% [REVIEW: ...]` instead
- Modify the meaning of any extracted text, even if it appears to be an error
**ALWAYS:**
- Preserve the original section ordering and numbering
- Keep mathematical notation exactly as it appears in the PDF
- Generate compilable LaTeX — the user should be able to run `pdflatex` immediately
- Mark uncertain constructions with `% [UNCERTAIN: description]`
- Leave placeholder cite keys that are easy to find-and-replace later
**AFTER RECONSTRUCTION:**
- If the reconstructed .tex has compilation errors, suggest running `/latex-rescue` to fix them
- If the user wants to polish the reconstructed text, suggest `/latex-polish`
- If the user needs to format for a specific venue, suggest `/latex-fmt`
**BOUNDARY CASES:**
- "Scanned PDF" (image-based) → explain that OCR is needed (tesseract), not standard extraction. See `references/pdf-extraction-guide.md` for OCR fallback instructions.
- "Corrupted PDF" → extract what you can, note what's missing
- "Encrypted/restricted PDF" → ask user to remove restrictions first
- "Huge PDF" (100+ pages) → ask whether to extract all or specific sections
## Reference Files
- **`references/pdf-extraction-guide.md`** — Detailed pymupdf/pdfplumber API reference and extraction recipes
- **`references/structure-detection.md`** — Heuristics for detecting document structure from PDF blocks
- **`references/math-reconstruction.md`** — Unicode/PDF math glyphs → LaTeX command mapping
- **`references/table-reconstruction.md`** — Table extraction and tabular environment generationNo comments yet. Be the first to comment!