How to Extract Text from PDF Files with Google Apps Script
How to Extract Text from PDF Files with Google Apps Script?
Google Apps Script allows you to extract text from PDF files using the DriveApp and Optical Character Recognition (OCR) via Google Drive API. This guide covers manual and automated methods to extract text from PDFs in Google Drive.
✅ Methods to Extract Text from PDFs
🔹 1️⃣ Using Google Drive’s Built-in OCR (Manual Method)
🔹 2️⃣ Using Google Apps Script + Google Drive API (Automated OCR Extraction)
🔹 3️⃣ Using Apps Script to Extract Text from Non-Scanned PDFs
🔹 4️⃣ Using Google Docs for Manual PDF to Text Conversion
🔹 5️⃣ Using Third-Party APIs (For Complex PDFs with Tables & Images)
Each method serves a different purpose, so let’s explore them in detail.
📌 1️⃣ Manual Method: Convert PDF to Text Using Google Drive’s OCR
Google Drive has a built-in OCR (Optical Character Recognition) feature that extracts text from PDFs.
Steps:
- Upload the PDF to Google Drive.
- Right-click the PDF → Open with → Google Docs.
- Google Docs will convert the PDF to editable text.
- Copy & Save the extracted text.
📌 Pros: Simple, no coding needed.
📌 Cons: Does not preserve formatting (tables, images are lost).
🤖 2️⃣ Automated OCR Extraction Using Google Apps Script
This method automates PDF-to-Text conversion using Google Drive API.
🔹 Step 1: Enable Google Drive API
- Open Google Apps Script (
Extensions → Apps Script
). - Click Project Settings (⚙️) → Enable Google Drive API.
🔹 Step 2: Copy and Paste the Script
javascriptfunction extractTextFromPDF() { var folderId = "YOUR_FOLDER_ID"; // Folder where PDFs are stored
var folder = DriveApp.getFolderById(folderId);
var files = folder.getFilesByType(MimeType.PDF);
while (files.hasNext()) {
var file = files.next();
var blob = file.getBlob();
// Convert PDF to Google Docs format for OCR
var newFile = DriveApp.createFile(blob).setMimeType(MimeType.GOOGLE_DOCS);
// Extract text from the converted Google Docs file
var doc = DocumentApp.openById(newFile.getId());
var text = doc.getBody().getText();
Logger.log("Extracted text from " + file.getName() + ":\n" + text);
// Optional: Save text as a new file
var textFile = DriveApp.createFile(file.getName() + ".txt", text, MimeType.PLAIN_TEXT);
}
}
🔹 Step 3: Run the Script
- Click Run ▶ in Apps Script.
- Grant permissions if asked.
- It will extract text from PDFs in the specified folder and save it as a
.txt
file.
📌 Pros: Automated, works for multiple PDFs.
📌 Cons: May struggle with complex layouts (tables, images).
📑 3️⃣ Extract Text from Non-Scanned PDFs
If the PDF already contains selectable text (not a scanned document), we can use PDFBox
in Apps Script.
🔹 Step 1: Install a PDF Processing Library
Since Google Apps Script lacks built-in PDF parsing, use a third-party API like PDFBox.
🔹 Step 2: Use a PDF Parsing API
javascriptfunction extractTextFromNonScannedPDF() { var fileId = "YOUR_PDF_FILE_ID"; // Replace with your PDF file ID
var file = DriveApp.getFileById(fileId);
var blob = file.getBlob();
var url = "https://api.pdf.co/v1/pdf/convert/to/text";
var options = {
method: "POST",
headers: {
"x-api-key": "YOUR_API_KEY" // Get an API key from PDF.co
},
payload: {
file: blob
},
muteHttpExceptions: true
};
var response = UrlFetchApp.fetch(url, options);
Logger.log(response.getContentText());
}
📌 Pros: Works for text-based PDFs.
📌 Cons: Requires third-party API (PDF.co).
📄 4️⃣ Convert PDF to Text Using Google Docs (Manual Method)
If you don’t want to use scripts, Google Docs can extract text manually.
Steps:
- Upload the PDF to Google Drive.
- Right-click → Open with → Google Docs.
- Google Docs converts the PDF into an editable text file.
- Copy-paste or save it.
📌 Pros: No coding required.
📌 Cons: Manual process, loses formatting.
🛠 5️⃣ Using Third-Party APIs (For Complex PDFs with Tables & Images)
For structured data (tables, images, etc.), use external OCR APIs.
Popular APIs for PDF Text Extraction:
- Google Cloud Vision API (cloud.google.com/vision)
- Adobe PDF Extract API (developer.adobe.com)
- PDF.co (pdf.co)
Example: Using Google Cloud Vision API
javascriptfunction extractTextWithVisionAPI() { var fileId = "YOUR_PDF_FILE_ID";
var file = DriveApp.getFileById(fileId);
var blob = file.getBlob();
var apiKey = "YOUR_GOOGLE_CLOUD_VISION_API_KEY";
var url = "https://vision.googleapis.com/v1/images:annotate?key=" + apiKey;
var payload = {
requests: [{
image: { content: Utilities.base64Encode(blob.getBytes()) },
features: [{ type: "TEXT_DETECTION" }]
}]
};
var options = {
method: "POST",
contentType: "application/json",
payload: JSON.stringify(payload)
};
var response = UrlFetchApp.fetch(url, options);
Logger.log(response.getContentText());
}
📌 Pros: Best for scanned PDFs with images.
📌 Cons: Requires Google Cloud API setup.
🚀 Comparison of Methods
Method | Best For | Pros | Cons |
---|---|---|---|
Google Drive OCR (Manual) | Simple text extraction | No coding needed | Manual, loses formatting |
Apps Script (Automated OCR) | Automating text extraction | Free, works in Drive | Struggles with images/tables |
PDF Parsing API (Non-Scanned PDFs) | PDFs with selectable text | Works well for structured PDFs | Requires third-party API |
Google Docs Conversion (Manual) | One-time use | No scripting needed | Formatting lost |
Google Cloud Vision API | Extracting text from scanned PDFs | Best for images and complex PDFs | Requires Google API setup |
🎯 Final Thoughts
- If you need automation, use Apps Script + Google Drive API.
- If your PDF contains text-based content, use Google Docs or Apps Script.
- If you need table extraction, use third-party APIs like Google Vision API or PDF.co.
0 Comments