How to Convert PDF to Markdown Format
Markdown has become the lingua franca of technical documentation, developer blogs, GitHub repositories, static site generators, and modern note-taking applications. Its plaintext nature, portability, and version-control friendliness make it the format of choice for content that developers and technical writers create and maintain over time. But substantial content often arrives as PDFs — research papers, technical specs, legacy documentation, vendor guides — and converting that content into clean Markdown enables it to be integrated into modern workflows. Converting a PDF to Markdown is not a simple one-click operation. PDFs store text with visual formatting (bold, headers, lists) embedded as spatial and style properties, not as explicit semantic markup. Converting those visual properties into correct Markdown syntax (# for headers, ** for bold, - for lists) requires either intelligent software or a combination of automated conversion and manual cleanup. This guide explains the practical approaches to PDF-to-Markdown conversion — from converting through Word as an intermediate format to using dedicated tools and Python libraries. You will learn how to handle common conversion challenges including multi-column layouts, tables, code blocks, and mathematical notation, and how to produce clean Markdown that works well in your target system.
Why Convert PDF to Markdown?
Understanding the use cases helps choose the right conversion approach and manage expectations about the result. Documentation maintenance: Technical documentation distributed as PDFs needs to be maintained and updated. Converting to Markdown allows the documentation to live in a Git repository, be collaboratively edited through pull requests, and be published through tools like MkDocs, Docusaurus, or GitBook. Static site generators: Jekyll, Hugo, Gatsby, Eleventy, and similar static site generators use Markdown as the primary content format. Importing PDF content into these systems requires Markdown conversion. Note-taking and knowledge management: Obsidian, Logseq, Notion (via Markdown import), and Bear use Markdown natively. Importing PDF research papers or documents into these systems as Markdown enables linking, tagging, and search within the note-taking ecosystem. AI and LLM workflows: Large language model processing pipelines often work best with plaintext or Markdown inputs rather than PDFs. Converting documents to Markdown can improve the quality of AI-generated summaries, questions and answers, and content transformations. Version control: Markdown files are diff-friendly in a way PDFs are not. Git can show exactly what changed between versions of a Markdown document — essential for collaborative technical documentation.
The PDF-to-Word-to-Markdown Route
The most reliable conversion path for most PDFs goes: PDF → Word (DOCX) → Markdown. This two-step approach uses well-established conversion tools at each stage and allows inspection and correction at the intermediate DOCX stage. Step 1: Convert PDF to Word using LazyPDF's PDF to Word tool. This extracts the text with basic formatting preserved (headings, bold, lists) as Word styles, which is a better foundation for Markdown conversion than raw plaintext. Step 2: Convert the DOCX to Markdown. Pandoc (free, open source, command-line) is the gold standard for this conversion: pandoc input.docx -o output.md. Pandoc handles heading levels, bold, italic, lists, links, and tables remarkably well. For a GitHub-compatible Markdown flavor: pandoc input.docx -t gfm -o output.md (GFM = GitHub Flavored Markdown). Alternatives for the DOCX-to-Markdown step: mammoth.js (a JavaScript library focused on producing clean HTML from DOCX, which can be further converted to Markdown); MarkdownFromDocx (a Python utility); or the Writage plugin for Word (adds a native Markdown save option to Word). After the two-step conversion, always review the output in a Markdown editor (VS Code with the Markdown Preview extension, Typora, or online at dillinger.io). Check heading levels, bold/italic, list indentation, and table formatting.
- 1Upload the PDF to LazyPDF's PDF to Word tool and download the resulting .docx file
- 2Open the DOCX in Word or Google Docs to review the content and fix any obvious extraction errors
- 3Install Pandoc from pandoc.org if not already installed (free, cross-platform)
- 4Run: pandoc input.docx -t gfm -o output.md in terminal/command prompt
- 5Open the resulting .md file in a Markdown editor (VS Code, Typora) and review structure, headings, and tables
- 6Fix common issues: duplicate blank lines, incorrect heading levels, broken table formatting, and code blocks that need triple-backtick fencing
Direct PDF to Markdown Tools
Several tools attempt direct PDF-to-Markdown conversion without the Word intermediate step. PDF2MD (open source, Python): A Python library specifically designed for PDF-to-Markdown conversion. It attempts to detect heading structure from font sizes, identifies tables, and preserves list formatting. Install with pip install pdf2md. Mathpix Snip: While primarily a tool for extracting mathematical notation from images and PDFs, Mathpix has a PDF-to-Markdown mode that handles mathematical equations in LaTeX notation — essential for academic papers with formulas. Commercial product with a free tier. Unstructured.io: An open-source Python library designed for processing documents (PDF, DOCX, HTML) into structured output for AI/LLM pipelines. Can output Markdown-formatted content. Amazon Textract and Google Cloud Document AI: Cloud services that extract text and structure from PDFs. Both can output structured content that, with some post-processing, produces reasonable Markdown. More appropriate for batch processing at scale than individual document conversion. LLM-based conversion: Sending PDF content (extracted via PyPDF2 or similar) to GPT-4 or Claude with a prompt to 'reformat this content as clean Markdown' can produce surprisingly good results for well-structured text, especially when you want the LLM to also clean up OCR artifacts, normalize spacing, and fix formatting issues that automated tools miss. For academic papers, GROBID (open source, Java) is a machine learning tool that extracts structured content from scientific PDFs including section headings, abstract, references, and body text in TEI XML format, which can be converted to Markdown.
Handling Complex PDF Elements in Markdown
Several PDF elements require special handling during Markdown conversion: Multi-column layouts: Academic papers and magazines typically have two or three columns of text per page. Most conversion tools cannot determine the correct reading order across columns, producing output that interleaves text from different columns. Best approach: use a tool like pdfplumber to extract text from each column separately based on x-coordinate boundaries, then concatenate in the correct reading order before converting to Markdown. Tables: Complex tables with merged cells or nested headers are difficult to represent in standard Markdown table syntax (which only supports simple grids). For these, HTML tables within the Markdown file may be the best option — GitHub Flavored Markdown and most static site generators render HTML within Markdown. Alternatively, simplify complex tables to their most important data for the Markdown version. Code blocks: If the PDF contains code examples, identify them during review and add triple-backtick fencing with the language identifier: ```python. Code in PDFs often loses indentation and line breaks during conversion — these need manual correction to produce runnable code examples. Images and figures: Images embedded in PDFs need to be extracted separately and saved as image files, then referenced in Markdown with image syntax: . PDF-to-Word conversion may extract images into the DOCX, from which Pandoc creates image references. Footnotes and endnotes: Academic papers have extensive footnotes. Pandoc generally handles these when converting from DOCX, producing footnote syntax [^1]: Note text. Review footnote conversion carefully as footnote numbering can get displaced during conversion.
Cleaning and Polishing Converted Markdown
Automated conversion always produces output that needs review and cleanup. The quality of the cleanup determines whether the resulting Markdown is genuinely useful or just a rough approximation. Common cleanup tasks for converted Markdown: Remove excessive blank lines (more than 2 consecutive blank lines is usually unintentional). Fix heading hierarchy — conversion often produces H1/H2/H3 levels that do not match the document's logical structure. Remove page headers and footers that were extracted as text and ended up as content. Remove page numbers that appear as isolated numbers between paragraphs. Fix word-wrap artifacts where PDF line breaks become Markdown line breaks in the middle of sentences (Markdown treats a single newline as a space, but hard line breaks in paragraphs look messy and cause issues in some renderers). Table cleanup: Regenerate any complex tables that converted incorrectly. The most reliable approach for important tables is to re-create them manually in Markdown table syntax, copying the values directly from the PDF. Link preservation: PDFs may have hyperlinks that survived conversion to Word and then to Markdown. Verify that important links are correctly formatted in Markdown: [Link text](https://url.com). Non-functioning links may need manual correction. For large documents, create a cleanup checklist and work through it systematically. Trying to clean as you go is slower than doing a single pass for each type of issue (first pass: remove headers/footers; second pass: fix heading levels; third pass: fix tables; fourth pass: review links).
Frequently Asked Questions
What is the best free tool to convert PDF to Markdown?
The most reliable free approach is Pandoc with a DOCX intermediate: convert PDF to Word using LazyPDF (free), then use Pandoc (free, command-line) to convert DOCX to Markdown. Pandoc is the gold standard for document format conversion and produces cleaner Markdown from DOCX than most alternatives. For a single-step free tool, pdf2md (Python library) or the online tool PDF2Markdown.com are options, though output quality varies with document complexity.
Does PDF to Markdown conversion preserve formatting like bold and headers?
Heading levels and bold/italic formatting are usually preserved when the source PDF was created from a properly formatted Word or InDesign document with consistent styles. PDFs where formatting was applied manually (text enlarged and bolded without using heading styles) are less reliable — the conversion tool may not recognize manual formatting as structural headings. Going through a DOCX intermediate gives you the opportunity to fix heading style assignments before the final Markdown conversion.
How do I convert a PDF with LaTeX equations to Markdown?
Standard PDF-to-Markdown tools do not recognize mathematical notation and will output garbled text or skip equations entirely. Mathpix Snip is the most widely used tool for this use case — it recognizes equations in PDFs and outputs LaTeX markup that can be embedded in Markdown as $...$ (inline) or $$...$$ (display math). For academic papers, GROBID and specialized academic paper extraction tools have better equation handling than general-purpose PDF converters.
Can I convert a scanned PDF to Markdown?
Yes, but with extra steps and lower accuracy. First apply OCR to create a searchable text layer (using LazyPDF's OCR tool, Adobe Acrobat, or Tesseract). The OCR creates a PDF with embedded text. Then convert that text-layer PDF through the PDF-to-Word-to-Markdown pipeline. Accuracy is limited by OCR quality — handwriting and degraded documents will produce poor results. For academic content, check if a digital version (HTML or EPUB) exists at the publisher's site before attempting OCR-based PDF conversion.
How do I handle images in PDF-to-Markdown conversion?
Images embedded in PDFs need to be extracted as separate image files and then referenced from the Markdown document. When converting PDF to DOCX using LazyPDF, images are typically embedded in the DOCX. When Pandoc converts to Markdown, it extracts these images to a media folder and creates image references automatically. Review all images: some may be formatting artifacts (borders, decorative elements) that should be removed, while others are content images that need descriptive alt text added.