creating RTF file with list of PDF annotations and links

I am still trying to find an optimal workflow for my research in DTPO. I have found Bill DeVille’s suggestion of using an rtf file very relevant and yet I am implementing it in a very cumbersome way. Instead, I would prefer to make highlights, notes, etc, directly in Acrobat or Preview and then have a program that generates an RTF file with the annotations including highlighted text and linked to the appropriate place in the DTPO document. This I feel would close “my digital research circle”

My first question is whether this has been implemented before?

I have been playing around with PDF Kit. There is a very relevant example program here:

developer.apple.com/library/mac … lementID_2

I think the best way to pull out a list of annotations would be iterating through the pages.

MacRuby code


framework 'Foundation'
framework 'ScriptingBridge'
framework 'cocoa'
framework 'Quartz'
DTPO = SBApplication.applicationWithBundleIdentifier("com.devon-technologies.thinkpro2")

load_bridge_support_file 'DevonThink.bridgesupport'
tw = DTPO.thinkWindows
rec = tw[0].contentRecord
path = rec.path

path_str = NSString.alloc.initWithString(path)
url = NSURL.alloc.initFileURLWithPath(path_str)
pdfDoc = PDFDocument.alloc.initWithURL(url)
pdfPage = pdfDoc.pageAtIndex(0)
annotations = pdfPage.annotations

This gets a PDFDocument instance which can then be manipulated. This is basically just Objective-C code with an interpreter for convenience. I will post my further work in this thread.

PDFDocument class reference:
developer.apple.com/library/mac … TP40003873

PDFPage class reference:
developer.apple.com/library/mac … TP40003875

I should add that I am quite new at Mac programming but am an experienced C/bash/Java/Python programmer, so if other people would find this program interesting or helpful I would encourage them to try and contribute.

developer.apple.com/library/mac … rence.html

There does not seem to be a way to be a way to directly get the text of a highlight. Rather one would have to select the bounds of the markup and make a selection.

Otherwise one can call the contents method on the textual annotation classes and get back the contents. It should be easy to take this string, add it to a rtf file and then associate the reference with the page# in a DT link

Maybe Skim already does everything you’re looking to program? Lots of DEVONthink users seem to prefer Skim over DEVONthink for annotations (search the forum). E.g., Skim will produce the RTF you’re looking for. Skim is open source, widely used, and actively developed.

Alright I have looked at Skim and written some code for it. The good news is that one can extract the text from a highlight. Bad news is that the Skim notes class does not see Standard(adobe/preview/DT) annotations.

Are you aware of a program that extracts skim notes and integrates them with DTPO?

For my personal use I would prefer to stay with standard annotations in Acrobat.

Skim > File > Convert Notes… will convert Acrobat and Previews annotations to Skim notes. I frequently open Acrobat-annotated PDFs in Skim, convert (but do not save) then print or export the annoations.

Robin Trew (@houthakker) has done an enormous amount of work on this

(These answers are all available by searching the forum, BTW.)

Thanks for this. I will certainly look into that as an easier solution because Skim is quite accessible from Scripting Bridge.

I almost have it though *******argh, you can get a list of annotations from a page and for all of the textual annotations, getting the contents is quite easy. For the highlights, it is quite easy to get a rectangle representing the bounds. However, MacRuby keeps crashing on me when passing it the rectangle to get the selection text and I can’t figure out how to understand this method call:

selectionFromPoint:toPoint:
Returns the text between the two specified points in page space.

  • (PDFSelection *)selectionFromPoint:(NSPoint)startPoint toPoint:(NSPoint)endPoint

That was just my stupidity for not knowing Objective-C. Can’t get past the crashes though using the Quartz framework sigh. I guess using the Skim 2 Devonthink script is the way to go.

I could also look at using the Acrobat scripting library as a mechanism to extract annotations but that would limit such a program’s usefulness to those who have Acrobat.

One powerful feature that I would request in future versions of DTPO is the ability to search within the annotation layer of PDF’s, the ability to specify only searching through the annotation layer. After playing with the Quartz framework I understand how difficult this would be to implement but I’m sure the DevonTech engineers, being German after all, are much more capable than I.

The program Mendeley implements this(though they are using their own annotation library) so it is quite different. In fact I am not aware of any program that allows searching through the annotation layer alone. Alas, maybe it is a job best left for Apple and their billions.

quitters never win and winners never quit so says the american idiom…

Anyway I have basic functionality working for all types of PDF annotations. The one thing I need help with is creating a new rich text document in which I can store the annotation strings. I have browsed through the applescript dictionary and find records the most relevant mechanism, but I haven’t figured out how to create them in a given group…

I’ve converted my code to Python, which has dealt with the crashing issues. I am excited about the potential of this program as a stopgap between when DevonTech adds annotation search functionality.

This is Python code BTW. I think new Macs ship with PyObjC built in.


from Foundation import *
from ScriptingBridge import *
from Cocoa import *
from Quartz import *

def get_selection(annotation, pdfPage):
	bounds = annotation.bounds()
	selection = pdfPage.selectionForRect_(bounds)
	return selection.string()

DTPO = SBApplication.applicationWithBundleIdentifier_("com.devon-technologies.thinkpro2")

#select the current database, this is used in creating a new record
db = DTPO.databases()[0]

group = db.currentGroup()

tw = DTPO.thinkWindows()
win1 = tw[0]
rec =  win1.contentRecord()
path = rec.path()
path_str = NSString.alloc().initWithString_(path)

url = NSURL.alloc().initFileURLWithPath_(path_str)

pdfDoc = PDFDocument.alloc().initWithURL_(url)
num_pages = pdfDoc.pageCount()
page_number = win1.currentPage() + 1

page_url = rec.referenceURL() + '?page=' + str(win1.currentPage())

pdfPage = pdfDoc.pageAtIndex_(0)
an = pdfPage.annotations()

new_record_filename = "Annotation-" + rec.name()
#new_rec = rec.createRecordWithIn_(db)





"""
for i in range(len(an)):
	if an[i].type() == 'Highlight' or an[i].type() == 'Square' or an[i].type() == 'Circle':
		contents = get_selection(an[i], pdfPage)
	else
		contents = an[i].contents()

"""



Ok, I have it working in a way. The PDF needs to be in the most prominent window of DT and TextEdit needs to be open with a single blank document.

The script iterates through the pages of the PDF, pulls out the annotations, and puts them in TextEdit along with the page url of the pdf in DT. Currently highlights, squares, circles, and text annotations are supported.

The links generated are not clickable, but I hope to address this soon.

Run it with python

edit: This will work best for annotations made with Acrobat. I don’t recommend it for use otherwise. It seems that when you highlight in DT or Preview you are actually making several annotations instead of one. Also, it gets the highlighted text from a rectangular selection on the page. This can lead to the introduction of some unwanted characters. Same goes for squares and circles. So in sum, I would caution using this script. The Skim to Devonthink script is certainly better for general purposes.


"""
The purpose of this script is to extract the annotations from a PDF file and insert them into a rich
text document with the appropriate hyperlink to the pdf page in DTPO.  This applies to annotations 
created with Preview or Acrobat compatible applications.

This requires that the PDF document be the open file in DTPO, i.e. be in think window 1

Furthermore, this requires that text edit be open and have a single blank document

Future releases of this script will improve on robustness of this so that things won't be so much of a concern

Robert L. Cloud, rcloud@gmail.com
http://www.robertlouiscloud.com
http://www.robertcloudphotography.com

Copyright (C) 2012 Robert L. Cloud
"""

from Foundation import *
from ScriptingBridge import *
from Cocoa import *
from Quartz import *

def get_selection(annotation, pdfPage):
	"""
	For certain types of annotations(highlights, squares, circles, we have to get the selection based on the bounding
	area of the annotation.  
	"""
	bounds = annotation.bounds()
	selection = pdfPage.selectionForRect_(bounds)
	return selection.string()
	
#use scripting bridge framework to instantiate an object of SBApplication identified with DTPO bundle identifier
DTPO = SBApplication.applicationWithBundleIdentifier_("com.devon-technologies.thinkpro2")

#open text edit and create a new rtf document which we can append the text
TE = SBApplication.applicationWithBundleIdentifier_("com.apple.TextEdit")
doc = TE.documents()[0]
txt = doc.text()

tw = DTPO.thinkWindows()
win1 = tw[0]
rec =  win1.contentRecord()
path = rec.path()
path_str = NSString.alloc().initWithString_(path)

url = NSURL.alloc().initFileURLWithPath_(path_str)

pdfDoc = PDFDocument.alloc().initWithURL_(url)
num_pages = pdfDoc.pageCount()
page_number = win1.currentPage() + 1

output_str = ""

#Iterate through the pages of the PDF and pull out the annotations
#Currently Text, Highlights, Squares, and Circles are being handled
for j in range(num_pages):
	page_url = rec.referenceURL() + '?page=' + str(j) + "\n\n"
	output_str = output_str + page_url
	pdfPage = pdfDoc.pageAtIndex_(j)
	an = pdfPage.annotations()
	for i in range(len(an)):
		if an[i].type() == 'Highlight' or an[i].type() == 'Square' or an[i].type() == 'Circle':
			contents = get_selection(an[i], pdfPage)
			output_str = output_str + an[i].type() + "\n" + contents + "\n\n"
		elif an[i].type() == 'Text':
			contents = an[i].contents()
			output_str = output_str + an[i].type() + "\n" + contents + "\n\n"
		else:
			continue

txt.setTo_(output_str)


This is the gold standard for PDF workflow right now! Gotta have it!

The programming is way over my head and so I don’t think I’ll try to adopt RCloud’s script. But I commend the effort.

If only DTPO could get an extension like Zotfile 2.0 for Zotero. It seems like the key piece is pdf.js, whatever that is:

★ Sync PDFs with your iPad or Android tablet

To read and annotate PDF attachments on your mobile device, zotfile can sync PDFs from your Zotero library to your (mobile) PDF reader (e.g. an iPad, Android tablet, etc.). Zotfile sends files to a location on your PC or Mac that syncs with your PDF reader App (PDF Expert, iAnnotate, GoodReader etc.), allows you to configure custom subfolders for easy access, and even extracts the annotations and highlighted text to Zotero notes when you get the files back from your tablet. For instruction, click here.

★ Extract Annotations from PDF Files

After highlighting and annotating pdfs on your tablet (or with the PDF reader application on your computer), ZotFile can automatically extract the highlighted text and note annotations from the pdf. The extracted text is saved in a Zotero note. Thanks to Joe Devietti, this feature is now available on all platforms based on the pdf.js library.

columbia.edu/~jpl2136/zotfile.html
Any idea if this sort of thing is easily implementable for DTPO?

This is pdf.js: mozilla.github.com/pdf.js/extensions/firefox/pdf.js.xpi

addons.mozilla.org/en-US/firefox/addon/pdfjs/

I understand lot of people use Skim for that purpose but I don’t want to use it as Skim got is own way of higlighting things and is not really compatible with other PDF soft on Mac or iOs.
The better way to do it will be IMHO to use this pdf.js access to annotation and create a file out of it. Or maybe just using this part of pdf.js could make annotation in PDF searchable in Sente.
Devonthink should really think about more powerful pdf management and search of annotations.
If it’s not going to happen, a script relying on pdf.js to extract annotations from a PDF would be really useful.
Is somebody have find a way to extract annotations in PDF?
Is Javascript part code from pdf.js is usable in an Apple Script?

Automator have an action for PDF to Extract annotations from PDF, It could be an interesting thing.
It seems limited as it reports highlight but don’t list the text higlighted.
As the text is in the PDF, I think, it’not really a problem as it could be searched but the action could access text in notes on the PDF and DT would have to index that.
As anybody have put this Automator action in useful use?