How to Convert a PDF to a Text File

Extracting text from a PDF is a common need for many workflows: feeding document content into language models or text analysis tools, importing text into a writing application for editing, extracting data for spreadsheet analysis, archiving content for full-text search, or simply getting the words out of a PDF into a format that other applications can use. How you extract text from a PDF depends critically on what type of PDF you have. Searchable PDFs contain actual text data embedded in the file — you can select and copy text in a PDF viewer, which means extraction is straightforward. Scanned PDFs are essentially photographs of pages with no embedded text; extracting text from these requires OCR (Optical Character Recognition) software that reads the image and identifies characters. This guide covers both scenarios with practical methods ranging from browser-based tools to command-line options, along with advice for handling difficult extraction cases like complex layouts and multiple columns.

Method 1: PDF to Word, then Save as Text

For searchable PDFs, converting to Word format first and then saving as plain text from Word is a reliable two-step approach. LazyPDF's PDF-to-Word converter extracts the text with basic formatting preserved (headings, paragraphs, basic tables) into a .docx file. Once in Word or Google Docs, you can save or export as .txt (plain text) using File → Save As → Plain Text, or copy all the text and paste into a text editor like Notepad or TextEdit. This method works well for most business documents, reports, and articles. The Word conversion step preserves paragraph structure better than direct PDF-to-text extraction, which helps maintain readable document organization. Complex layouts like multi-column magazine-style text or tables with many merged cells may not extract with perfect structure, but the text content will be present and readable.

1Upload your searchable PDF to LazyPDF's pdf-to-word tool.
2Download the converted .docx file and open it in Word or Google Docs.
3Review the extracted text for any formatting issues and correct them.
4Save as plain text (.txt) via File → Save As, or copy and paste into a text editor.

Method 2: OCR for Scanned PDFs

Scanned PDFs — documents created by scanning physical paper — contain page images with no embedded text. Standard PDF-to-text tools produce nothing useful from these files because there's no text data to extract. OCR is required to analyze the image and identify the characters, words, and paragraphs visible in the scan. LazyPDF's OCR tool uses Tesseract, an open-source OCR engine originally developed by HP and now maintained by Google. It produces accurate results for clean, clearly printed documents with standard fonts. Upload your scanned PDF, select the language, and LazyPDF will analyze each page and extract the recognized text. The output is the recognized text, which you can copy into a text editor or save as needed. OCR accuracy is excellent for clearly printed text at 300 DPI or higher, and decreases for poor-quality scans, handwriting, unusual fonts, or very small text.

Handling Complex Layouts

Multi-column layouts are the most common challenge in PDF text extraction. Academic papers, newspaper articles, and magazine content often use two or three columns of text per page. When extracted naively, these read incorrectly — the text from the first line of all columns is concatenated before the second line of any column, producing garbled output. For multi-column content, some PDF-to-text tools have layout analysis that attempts to identify column boundaries and extract text in reading order. Results vary. For the highest accuracy on complex layouts, consider extracting text from a clearly formatted single-column version of the document if one is available (like the HTML version of an academic paper from the publisher's website). For scanned multi-column documents, OCR tools with layout analysis mode (Tesseract's --psm parameter controls this) produce better results than the default mode.

Improving OCR Accuracy

If your OCR results contain errors, several factors may be contributing. Scan quality is the most important: 300 DPI minimum, clean white background, no skew, and no dark borders from the scanner lid improve accuracy significantly. If you can re-scan the document, doing so at higher resolution and better lighting will produce better OCR than any post-processing. For existing poor-quality scans, some pre-processing before OCR can help: deskewing (straightening a skewed scan), binarization (converting to pure black and white), and denoising (removing scanner artifacts). These operations are available in image editing software (GIMP, ImageMagick) and some dedicated document scanning apps. After pre-processing, run OCR again and compare results. For handwritten content, Tesseract's accuracy is limited — specialized handwriting recognition tools or cloud OCR services (Google Cloud Vision, Amazon Textract) perform better.

Frequently Asked Questions

How do I know if my PDF is searchable or a scan?

Try to select and copy text in your PDF viewer. If you can highlight individual words and copy them to the clipboard, the PDF contains embedded text and is searchable. If you can't select text at all, or if you can only select the entire page as an image, the PDF is a scan with no embedded text. Another test: search for a word you can see in the document using Ctrl+F — if it finds the word, the PDF is searchable; if it finds nothing, it's image-based.

How accurate is OCR for old or poor-quality scanned documents?

OCR accuracy for clean, modern printed documents at 300 DPI is typically 97–99% — meaning roughly 1–3 errors per 100 characters. For older documents with faded ink, yellowed paper, or damaged pages, accuracy can drop to 80–90%, requiring significant manual correction. For very old documents with obsolete typefaces or archaic spelling, accuracy may be lower still. For important documents requiring high accuracy, plan for a proofreading pass after OCR extraction.

Can I extract text from a password-protected PDF?

If a PDF has a user password (required to open the file), you must enter that password before any text extraction can occur. If you have the password and the document is otherwise a normal searchable PDF, any PDF-to-text tool can extract the text once the password is entered. If the PDF has content restrictions (owner password preventing copying), some tools will still extract text for legitimate use. If you own the document and need to remove restrictions to extract text, LazyPDF's unlock tool can help if you have the appropriate access rights.

Extract text from any PDF — convert to Word or use OCR, free with LazyPDF.

Try OCR Free

Tips & Tricks