I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:
function MyFunction() {
var threads = GmailApp.search ('label:templabel');
var messages = GmailApp.getMessagesForThreads(threads);
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length;
var messageBody = messages[i][0].getBody();
var messageSubject = messages [i][0].getSubject();
var attach = messages [i][0].getAttachments();
var attachcontent = attach.getContentAsString();
GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
}
}
Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?
Thank you very much in advance.
Best, Phil
Edit: Updated for DriveApp, as DocsList deprecated.
I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.
As you've found out,
getContentAsString()
does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.First, we'll get the attachment as a
Blob
, a utility class used by several Services to exchange data.So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled
templabel
, here is howmyFunction()
looks:We're relying on a helper function,
pdfToText()
, to convert our pdfblob
into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by settingkeepTextfile: false
, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.pdfToText()
This utility is available as a gist. Several examples are provided there.
A previous answer indicated that it was possible to use the Drive API's
insert
method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable theDrive API
from the editor, underResources > Advanced Google Services
.pdfToText()
uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regularDocumentService
to extract the document body as plain text.The conversion to DriveApp is helped with this utility from Bruce McPherson: