How to Extract Text from PDF Files with Google Apps Script

How to Extract Text from PDF Files with Google Apps Script?

Google Apps Script allows you to extract text from PDF files using the DriveApp and Optical Character Recognition (OCR) via Google Drive API. This guide covers manual and automated methods to extract text from PDFs in Google Drive.

✅ Methods to Extract Text from PDFs

🔹 1️⃣ Using Google Drive’s Built-in OCR (Manual Method)

🔹 2️⃣ Using Google Apps Script + Google Drive API (Automated OCR Extraction)

🔹 3️⃣ Using Apps Script to Extract Text from Non-Scanned PDFs

🔹 4️⃣ Using Google Docs for Manual PDF to Text Conversion

🔹 5️⃣ Using Third-Party APIs (For Complex PDFs with Tables & Images)

Each method serves a different purpose, so let’s explore them in detail.

📌 1️⃣ Manual Method: Convert PDF to Text Using Google Drive’s OCR

Google Drive has a built-in OCR (Optical Character Recognition) feature that extracts text from PDFs.

Steps:

Upload the PDF to Google Drive.
Right-click the PDF → Open with → Google Docs.
Google Docs will convert the PDF to editable text.
Copy & Save the extracted text.

📌 Pros: Simple, no coding needed.
📌 Cons: Does not preserve formatting (tables, images are lost).

🤖 2️⃣ Automated OCR Extraction Using Google Apps Script

This method automates PDF-to-Text conversion using Google Drive API.

🔹 Step 1: Enable Google Drive API

Open Google Apps Script (Extensions → Apps Script).
Click Project Settings (⚙️) → Enable Google Drive API.

🔹 Step 2: Copy and Paste the Script

javascript
function extractTextFromPDF() {
  var folderId = "YOUR_FOLDER_ID"; // Folder where PDFs are stored
  var folder = DriveApp.getFolderById(folderId);
  var files = folder.getFilesByType(MimeType.PDF);
  
  while (files.hasNext()) {
    var file = files.next();
    var blob = file.getBlob();
    
    // Convert PDF to Google Docs format for OCR
    var newFile = DriveApp.createFile(blob).setMimeType(MimeType.GOOGLE_DOCS);
    
    // Extract text from the converted Google Docs file
    var doc = DocumentApp.openById(newFile.getId());
    var text = doc.getBody().getText();
    
    Logger.log("Extracted text from " + file.getName() + ":\n" + text);
    
    // Optional: Save text as a new file
    var textFile = DriveApp.createFile(file.getName() + ".txt", text, MimeType.PLAIN_TEXT);
  }
}

🔹 Step 3: Run the Script

Click Run ▶ in Apps Script.
Grant permissions if asked.
It will extract text from PDFs in the specified folder and save it as a .txt file.

📌 Pros: Automated, works for multiple PDFs.
📌 Cons: May struggle with complex layouts (tables, images).

📑 3️⃣ Extract Text from Non-Scanned PDFs

If the PDF already contains selectable text (not a scanned document), we can use PDFBox in Apps Script.

🔹 Step 1: Install a PDF Processing Library

Since Google Apps Script lacks built-in PDF parsing, use a third-party API like PDFBox.

🔹 Step 2: Use a PDF Parsing API

javascript
function extractTextFromNonScannedPDF() {
  var fileId = "YOUR_PDF_FILE_ID"; // Replace with your PDF file ID
  var file = DriveApp.getFileById(fileId);
  var blob = file.getBlob();
  
  var url = "https://api.pdf.co/v1/pdf/convert/to/text";
  
  var options = {
    method: "POST",
    headers: {
      "x-api-key": "YOUR_API_KEY" // Get an API key from PDF.co
    },
    payload: {
      file: blob
    },
    muteHttpExceptions: true
  };
  
  var response = UrlFetchApp.fetch(url, options);
  Logger.log(response.getContentText());
}

📌 Pros: Works for text-based PDFs.
📌 Cons: Requires third-party API (PDF.co).

📄 4️⃣ Convert PDF to Text Using Google Docs (Manual Method)

If you don’t want to use scripts, Google Docs can extract text manually.

Steps:

Upload the PDF to Google Drive.
Right-click → Open with → Google Docs.
Google Docs converts the PDF into an editable text file.
Copy-paste or save it.

📌 Pros: No coding required.
📌 Cons: Manual process, loses formatting.

🛠 5️⃣ Using Third-Party APIs (For Complex PDFs with Tables & Images)

For structured data (tables, images, etc.), use external OCR APIs.

Popular APIs for PDF Text Extraction:

Google Cloud Vision API (cloud.google.com/vision)
Adobe PDF Extract API (developer.adobe.com)
PDF.co (pdf.co)

Example: Using Google Cloud Vision API

javascript
function extractTextWithVisionAPI() {
  var fileId = "YOUR_PDF_FILE_ID";
  var file = DriveApp.getFileById(fileId);
  var blob = file.getBlob();
  
  var apiKey = "YOUR_GOOGLE_CLOUD_VISION_API_KEY";
  var url = "https://vision.googleapis.com/v1/images:annotate?key=" + apiKey;
  
  var payload = {
    requests: [{
      image: { content: Utilities.base64Encode(blob.getBytes()) },
      features: [{ type: "TEXT_DETECTION" }]
    }]
  };
  
  var options = {
    method: "POST",
    contentType: "application/json",
    payload: JSON.stringify(payload)
  };
  
  var response = UrlFetchApp.fetch(url, options);
  Logger.log(response.getContentText());
}

📌 Pros: Best for scanned PDFs with images.
📌 Cons: Requires Google Cloud API setup.

🚀 Comparison of Methods

Method	Best For	Pros	Cons
Google Drive OCR (Manual)	Simple text extraction	No coding needed	Manual, loses formatting
Apps Script (Automated OCR)	Automating text extraction	Free, works in Drive	Struggles with images/tables
PDF Parsing API (Non-Scanned PDFs)	PDFs with selectable text	Works well for structured PDFs	Requires third-party API
Google Docs Conversion (Manual)	One-time use	No scripting needed	Formatting lost
Google Cloud Vision API	Extracting text from scanned PDFs	Best for images and complex PDFs	Requires Google API setup

🎯 Final Thoughts

If you need automation, use Apps Script + Google Drive API.
If your PDF contains text-based content, use Google Docs or Apps Script.
If you need table extraction, use third-party APIs like Google Vision API or PDF.co.

How to Extract Text from PDF Files with Google Apps Script