Automating Going Paperless with DevonThink

vanper · August 3, 2012, 11:55pm

I have used DevonThinkProOffice for my paperless workflow for the last 12 months or so and I have to say that I am finding it very frustrating - its just not as simple or automatic as it could be and I was hoping for some advice to improve it.

My current workflow:

Scan with ScanSnap to DT - but I cannot just press the blue button. Each document I then need to manually name and tag and then also specify which database - I have 3 databases - household info (for things like kids schoolwork, pamphlets, sport rosters etc), personal finances by financial year (bank statements, utility bills, expense receipts etc) and finances for our family company (separate legal entity but again bank statements, expense receipts etc).
Once all the documents are in the inbox of the correct database, I check to make sure they are all tagged and then move them into a single group called tagged documents.

I have this workflow because the AI used in See Also and Classify and Auto Classify seems to be incapable of sorting documents based on their titles, their tags and often even their contents. Seriously, I have over 12 months of nearly identical monthly telephone accounts each with the name of the phone company in the title, the same tag and obviously very near similar contents - it still cannot correctly auto classify a phone bill.

My children’s school work is another - a bunch of diverse information from newsletters, drawings and sport rosters but each tagged by child’s name - why can’t it put all the documents with the same tag in the same place?

What I would like for my paperless workflow is to be able to press the blue button, have the document ScanSnap to DT, be automatically named and tagged and filed UNLESS its a document that has unique characteristics - I should not have to manually intervene with phone bills, bank statements etc that are essentially the same document each time.

I recently read David Sparks’ book Paperless about how he uses Hazel to accomplish a similar feat. Hazel can rename files automatically and move them to a folder. However, I cannot see how I could integrate Hazel with DT because you cannot go past the global inbox - once documents are renamed and put there, I would have to manually move them to the correct database so no time is saved.

David has a nested folder hierarchy with Hazel automatically able to group into subfolders by years but this raises the question of how you would archive without losing the integrity of the folder system. Does you database just grow continuously forever with 20 years of phone bills in it? I had the same dilemma with DT and have ended up having to create a new database for each year.

A further complication with financial information is that here in Australia, the financial year runs from 1 July - 30 June so a simple calendar year doesn’t help much.

In all the discussions of paperless filing systems the workflow discusses capturing, processing/sorting and storage but it never talks about a system for removing/archiving old information. Do you just grow a more and more massive database - surely this slows down how well it runs, makes it more easy to be corrupted and makes it harder to find information.

There is information, like sports rosters, that has a use by date. Or what about tax audits? Here in Australia you are legally required to retain the last 5 years of documents. If you got audited, would you rather have 5 years of info the auditors could dig through or 20?!!! I would rather have only the required minimum so I could shrug my shoulders if they tried to delve further (not that I am a tax dodger - just saying).

So how can I structure my DevonThink to make all this filing more automatic? Would love some advice from the ‘power users’ who swear DT is the way to go paperless.

PS: I have Joe Kissell’s book but don’t think its all that helpful on the automation side of things…

korm · August 4, 2012, 10:57am

Classify and Auto Classify will do nothing in a database that has no groups.

Both actions evaluate where a document could go based on the groupings you’ve already created. They will not group things based on tags, because the Tags are already groupings that you assigned to documents. For example, if you tagged documents “Telephone Bills” then having DEVONthink automatically create a group for “telephone bills” just because they have that tag would be redundant.

It’s possible to use David Sparks’ Hazel method to sort your documents into folders in the file system, and also Index those folders in DEVONthink - instead of importing everything into DEVONthink. (You’d of course want to adjust David’s techniques for your own date & time localizations.)

Have you also viewed the tutorials on Mastering the A.I., Grouping tips & tricks, Indexing data, and Classifying?

You might use David’s Hazel method and indexed a rolling five fiscal years of data in your DEVONthink financial database. In other words, if your file system has a top level folders for FY2008, FY2009, FY2010, FY2011 and FY2012, then this year you’d be indexing the FY2008 through FY2012 top level folders - and that would automatically include all of their subfolders (months, categories, etc.). Next fiscal year, you’d create a top level FY2013 folder in the file system, index it in DEVONthink, delete FY2008 from your financial database, and delete FY2008 from disk. Just roll this process forward every July 1 and you’re set.

See above - that’s not the way Classify works. But if you’ve created groups in your database with sufficient granularity, then Classify will produce results.

I’m with you on that. I hate spending time renaming and tagging scanned documents - so I just dump everything into a file system folder, and every once in a while I batch OCR the entire mess and spend a hour or two naming things. I think you posted this suggestion elsewhere - but nicely-automated rules to rename and classify based on content would be a welcome addition to DEVONthink - but developing that it is a huge undertaking with little payback so I won’t be expecting the developers to do this any time soon.

Kind of off topic, but personally I don’t use tagging much anymore - either in or outside of DEVONthink. I realized I had spent hundreds of hours over the years assiduously tagging thousands of documents but all that labor hadn’t saved me a second of work. The search features of DEVONthink are so good - and, used properly, Spotlight is pretty good too - so I have never needed tags to help my research or work. The whole point of metadata classification is to make future search better and faster. If a workflow technique isn’t accomplishing that, then I’d suggest forgetting it.

There is no substitute, IMO, for sifting and grouping documents as they come into the database. It only takes a second to assign a document to the right group (or creating a new group or subgroup) when I create or import it. I use tools such as folder actions. My setting in Preferences > Import > Destination is Select Group so that DEVONthink always asks where something should go – avoiding the trap of dumping everything into the Global Inbox, which just defers the work. And I make frequent use of Tools > Show Groups & Tags.

Just observing the workflow over the years I’ve used DEVONthink, I’ve concluded that if I don’t spend five seconds assigning a document to the correct group (including making new groups) when that document first comes into the database, then I’ll spend 30 seconds cleaning up later on. The best rule of thumb for data management is do it right the first time. (BTW, because I’ve been very granular about groups over the years - preferring more groups than fewer - the DEVONthink classification tools are really useful for me.)

vanper · August 5, 2012, 1:57am

Yes I probably should have been clearer about this. In 2011, I used groups and got very frustrated with the amount of time I spent manually moving files into groups because I couldn’t get Classify to work. For 2012, I did away with groups altogether because it was quicker to add a tag when I named the file and then dump everything in a single group.

You know this was my first thought and I began to build the nest of folders but I realised that I actually really like the nice neat package that a DT database file gives me. Also there appears to be no way to adjust Hazel for financial years without learning applescript. And I feel like if I can just harness DT better, it is worth keeping everything inside it.

I did watch these yesterday morning but didn’t find much to progress my ‘automation’ drive. I did discover the second edition of Joe Kissell’s Devonthink book and the fact that I had totally missed the change in how groups and tags work together and I think this is probably part of why my 2011 group structure was not working too well.

I had a similar idea where I would keep 2 databases - 1 that would cover the rolling 5 years and an archive and every year I could just move the oldest year out of the the current database and into the archive. If I ensure to group by years (or would a year tag work better?) as stuff goes in, it should be easy to drag and drop from one database to the other.

When you say ‘sufficient granularity’ what should I be aiming for? With the children’s work, I just made a group for each child by their name. Classify couldn’t make the connection and I am guessing this is because the content of the document did not reveal the child’s name very often - the child’s name was in the tag. But how could I get a group like this to work - I don’t want to have smaller groups because you would end up with 1 or 2 documents in each one - sport roster, newsletter, etc.

I think the only reason I’m tagging, or for future, the only way I will tag, is if it can be automated with Hazel because then it doesn’t add any work. I am definitely over manually adding tags to each document!

See this is the part where I choke. I should only have to sift and group a document that is somehow unique. I should not have to manually group anything that comes in on a monthly basis which I estimate is 2/3 of the total paper flow.

I think you posted in another post that Hazel 3 could run an applescript telling DT which database and which group to move a document to. I think the time has come for me to learn some Applescript!

My plan of attack:

Scan everything to a folder and have the ScanSnap do the OCR so that files contents are available to Hazel. (Somehow I have to work out how I can make Scansnap do this without prompting me to save each file - I love how when you use the scan to DT profile you can press the blue button and it will scan, ocr and transfer to the global inbox with no user intervention)
Have Hazel rename and tag the file and then run the Applescript to move it to the right database & group. (Apart from learning Applescript, I need to find the right level of groups that hopefully Classify will start to come into its own too.)
Either periodically manually process what is left from here into DT OR have a catchall Hazel rule that moves the remaining documents into the global inbox.

This post has helped me work out how to try and get things working a bit more easily so thanks for your help

korm · August 5, 2012, 10:14am

In a ScanSnap Profile’s settings, in the Application tab, the option “None (Scan to File)” will prevent ScanSnap from prompting for a name.

I don’t know your data. But say you have a single group “Children Stuff”, then Classify probably won’t do much. But if you have “Children > Sirius > Spells; Children > Sirius > Potions; Children > Bathsheda > Spells; Children > Crispin > Potions”, and so on, then with OCRd or text documents that contain relevant text, Classify might assign files to the right buckets. It takes manual work to get this set up, and many users have reported that shorter documents in the target groups make Classification more efficient than longer documents. The point is - there’s no one solution or right way for Classify to work – it all depends on your data. YMMV.

Before diving into scripting, be sure to ask Paul Kim over at Noodlesoft. He is extremely helpful.

Also, SnapSnap has an interesting feature that might help. If the scanned source document is B&W and you’ve highlighted sections – with standard color highlighters – then it can convert the highlighted text to keywords. For instructions, see ScanSnap Help for the “File option” tab in ScanSnap Manager’s profile settings. “Keywords” are PDF metadata – they are not tags. (In DEVONthink see Tools > Show Properties to view a document’s keywords.) There are scripts in DEVONthink – see Help > Support Assistant > Extras– that will make tags out of keywords. Keywords are a standard OS X attribute, and so Spotlight and Hazel can search for them if they are told to do so. There are many ways to use this ScanSnap feature – one of them is configuring ScanSnap to create only one keyword and then have Hazel name the file using that keyword. For example, highlight a section of the document that can serve as the document name.

Hugh · August 5, 2012, 10:43am

korm:

Also, SnapSnap has an interesting feature that might help. If the scanned source document is B&W and you’ve highlighted sections – with standard color highlighters – then it can convert the highlighted text to keywords. For instructions, see ScanSnap Help for the “File option” tab in ScanSnap Manager’s profile settings. “Keywords” are PDF metadata – they are not tags. (In DEVONthink see Tools > Show Properties to view a document’s keywords.) There are scripts in DEVONthink – see Help > Support Assistant > Extras-- that will make tags out of keywords. Keywords are a standard OS X attribute, and so Spotlight and Hazel can search for them if they are told to do so. There are many ways to use this ScanSnap feature – one of them is configuring ScanSnap to create only one keyword and then have Hazel name the file using that keyword. For example, highlight a section of the document that can serve as the document name.

There’s a way to convert keywords to tags outside DTPO using Hazel, but Hazel has to be modified via the Terminal first. Details in the Hazel forum: as far as I remember, the search term is “Open Meta”.

However, a word of caution: I’ve found the highlighter/keyword procedure is not 100 per cent reliable. Of course it depends on both the legibility of the original document, and the salience of the highlighted text. On the recommendation of others I’ve found green highlighter works best.

vanper · August 6, 2012, 1:31am

I think I have the answer, and it was simpler than expected

My new ‘automatic’ workflow:

I have a Scansnap profile called ScantoActionFolder which uses None(ScantoFile) for the Application so there is no input past pressing the blue button on the scanner. The scan is OCRed by the ScanSnap and then goes straight to a folder called ActionFolder.
Hazel is watching the ActionFolder. So far I have just done my annual rates notice for the farm. When Hazel sees that the document contains all the words ‘Waratah-Wynyard’ and ‘Rates Notice’ it:

Renames the file with yyyy-mm Farm Rates.pdf
Adds a Comment ‘Farm Expenses’ - I can then convert this to a tag inside DT
Runs an Automator Workflow - this automator workflow just has 2 commands Set Current Group and Add Items to Current Group
Moves the file to the Trash

And that’s it - no scripting, no hard stuff and the document goes straight from the Scanner to the correct group in the correct database without me manually tagging, renaming, sorting, dragging or dropping. Obviously I have to create the Hazel/Automator rule set for each document type but once its done, then its done.

I experimented with the keyword thing yesterday too but I’ve only got a pink highlighter in my drawer and it didn’t work. Also, the rates notice is in colour. I might tinker further with this because it would be very useful to extract the exact date from within documents.

Technically this is the rates notice for the year ending 30 June 2013 - it would be awesome if I could name this file 2013-06-30 Farm Rates.pdf without touching it other than to highlight that date but I am willing to settle for it having the month it gets scanned for now.

Progress

edgley · August 6, 2012, 1:53am

From a post above:

“But if you have “Children > Sirius > Spells; Children > Sirius > Potions; Children > Bathsheda > Spells; Children > Crispin > Potions”, …”

I am glad that I have read this, as i was going a total different way with DT, as the above just didnt seem logical to me when I considered it.

I was going Tag mad, wishing the AI was helping out more. So you are saying that it is quicker to drag items into groups, and make groups up almost like tags. Then when you have this many of them, the AI can do its best job.

Bugger, thats an evening wasted! Oh well, at least it saves me time tomorrow, thanks!

Greg_Jones · August 6, 2012, 2:24am

There’s a lot of good discussion in this thread, but it is worth mentioning that one does not have to choose between using tags or using a comprehensive group structure. For all practical purposes, groups and tags behave the same in DEVONthink. If tagging is enabled for your groups, you’ll have the benefits of tags and groups. In my databases, I usually turn off group tags, but then I assign one or more tags to the group via the info panel (see the image of an example group info below). One reason for using this approach vs. enabling the group tag is if you have multiple groups in the database with the same name (Potions, Spells as example from above). This is a powerful way to use tags, and I’m not sure that a lot of people use this feature.

Tags are also inherited with nested groups, so I might assign the tag children to the group Children, Sirius to the group Sirius, spells to the Group Spells, etc. I can search or select the spells tag and see all the Spells documents for all children at once.

edgley · August 6, 2012, 3:02am

Tags and groups are the same except that tags dont actually store the file, so you would still need one group just as a dump; so long as I have understood all I have read tonight

The other problem with tags is added them on mass is impossible, so folders is much more quick. But the deal sealer, for me, is if the AI cannot work with tags.

I am sorry, but I do not follow why you do not use groups. Could you explain it another way please? If there is a way not to have multiple folders you are going to make me very happy

arnow · August 6, 2012, 7:48am

Although, in a technical sense, groups and tags may be the same kind of entity treated in different ways, from the user’s point of view groups and tags are very different (because DevonThink treats groups and tags in different ways, groups and tags behave differently)!

However, groups don’t store files.

DevonThink is very different from what you might think initially: the entries in a DevonThink database aren’t files, groups aren’t folders and replicants aren’t aliases.

The entries in DevonThink are references to files. The files themselves are (by default) stored in the .filesnoindex folder in the database (you can move them outside the database if you put their references in an indexed group but this is rather tricky – try it only when you know what you are doing).

It is the references that are organized into groups, not the files themselves! For that reason, it is important to make a clear distinction between what happens to the file and what happens to the reference.

When you move an entry you move the reference, not the file.
When you replicate an entry (that is a reference to a file) you create another reference to that file (not a reference to the original entry as when aliasing!).
When you delete an entry you delete the reference; when you delete the last reference to a file in the group hierarchy (that is: when you delete an entry that has no replicants), the file is deleted too.
When you duplicate an entry you duplicate the file to which it refers and create a reference to that duplicate in the group hierarchy of the database.

From the user’s view point, groups are collections of references: if you delete a group those references will be removed from your database (and if an entry in that group has no replicants the corresponding file will be removed too!).

Tags, on the other hand, are, from the user’s point of view, properties of documents: if you delete a tag, this tag will be removed from all the documents but nothing happens to their entries (references) in the hierarchy.

Replicants provide a way to refer to a file at more than one place in your group hierarchy.

Tags provide a way to characterize files in a manner that does not fit with your group hierarchy.

For example, my group hierarchy sorts my journal articles according to their topic. When I design a course (a course typically covers different topics, but does not use all the literature I have about each topic), I tag the literature I want to use with the tag for that course. In this way, I can use my group hierarchy and the see also function to decide which articles to use and quickly access the articles I selected by means of the tag without making any changes in my group hierarchy.

korm · August 6, 2012, 9:55am

.

arnow · August 6, 2012, 10:38am

Korm, I completely agree with the quote above. Actually, it expresses the main reason why I stress the difference between groups and tags. In my experience (when helping colleagues and friends) the developers’ view that groups are tags and tags are groups is very confusing, precisely because groups and tags behave very differently (this is also clear from the irony in edgley’s nice “Tags and groups are the same except …”).

I have no idea whether or not I describe a deeper reality. I don’t care. In my experience, however, my description is more useful to my colleagues and friends (and hopefully to some users of this forum) than the standard description suggesting that tags are groups, entries files, groups folders and replicants aliases. The latter idea instigates all sorts of unnecessary worries (‘am I deleting the original or the alias?’ ‘help the hierarchy isn’t backuped’ ) and leads to bad design of databases (I have colleagues who can’t use DevonThink’s AI because they group according to the courses they teach, while their subject classification is done by tags) and to serious problems and file loss when indexing (remember the case of lprint174?).

edgley · August 6, 2012, 10:59am

Thanks to all for the time taken to help me understand the basics.

The problem that I have with this use The Group thinking is this:

I come across things I want to store, but dont know what I might want to use them for some day in The Future, so I put them in a folder called Things for Some Use Some Day.

Things I come across that have an immediate use, go direct into the appropriate project folder (group).

I start a new project, and wonder what I things I have collected might be of use for it.
So I either search using terms relevant to the new project, or have to look through the Things for Some Use Some Day folder, or look through all the other project folders to see if I can find anything of use.

But if I then have a use for something, and move it from the Things for Some Use Some Day folder, it wont be there for me to see if I start another project that might need it.

I am making this a lot harder than it needs to be?
Lol, wouldn’t be the first time.

arnow · August 6, 2012, 11:02am

When you don’t index and as long as your database isn’t corrupted, it is not relevant where exactly the files are stored, but it is very important to know 1) that they are stored somewhere in the database, 2) that the group hierarchy contains references to the files rather than the files themselves, and 3) that further details concerning the location of the files are irrelevant.

A quick look at this forum will learn that when someone asks where DevonThink’s files are stored, one of the people from DevonTechnologies will quickly answer that they are in the .filesnoindex folder. Apparently they don’t agree with you and me that this is irrelevant.

Greg_Jones · August 6, 2012, 11:14am

It’s worth pointing out that this is accurate from the point of view of some, perhaps even the majority, of users. However, over the years here I have observed many users using tags and groups in very ‘creative’ ways that don’t necessarily confirm to this model. One way in particular is if tagging is turned on for all groups in the database, then tags become a semi-automated system to replicate documents in the database. Tags are the collections of references for these users, and they file their documents accordingly using the keyboard instead of a mouse or trackpad.

arnow · August 6, 2012, 11:51am

What are the ‘things’ you’re talking about? Articles that might be of use the next time you buy a new computer? Reviews of books you might want to read? An article that sounds interesting for a paper you might want to write when all the papers you’re writing now are ready?

There are many ways to solve your problem. Here is one:

Suppose you collect all kinds of information concerning environmental education. Than you might group the articles according to their subject,. You might have groups for different kinds of paysage, animals, plants, teaching methods and son. Some articles are relevant to more than one group in which case you put replicants in all the relevant groups (an article about Birds in Ireland will have references (replicants) in the ‘birds’ and in the ‘Ireland’ group).

New articles arrive in the inbox and sorted with help of the classify pane.

If you can’t see a structure in what you collect leave everything in the inbox or a ‘Things for Some Use Some Day’ group (if you do the latter, be sure to exclude it from classify in the info panel) and wait until you have a least a hundred things, after which you let DevonThink autogroup them.

When you start a new project, you decide which things you want to use and tag them ‘project 1’. You can find the things you need either by searching the database, or perhaps you have one or two groups in your hierarchy that contain most of the relevant entries (for example if your project is to develop a program for a birding trip to Ireland it makes sense to look for articles that are both in the Ireland and in the birds group). When you found one or more very important articles, use ‘see also’ to find more.

For a second project you tag every thing ‘project 2’ and so on.

No problem if there are articles that are tagged for more than one project.

If you like to work with project groups, rather than with subject groups and project tags, I recommend replicating (rather than moving) the items to project groups.

I myself strongly prefer subject groups and project tags over project groups because projects are usually too heterogenous in content to make efficient use of DevonThink’s AI.

arnow · August 6, 2012, 11:59am

Thanks for the addition! I completely agree (and I have experimented myself which such deviant uses)! However, pointing to the possibility of creative uses is quite different from setting inexperienced users on the wrong footing by routinely claiming that groups are tags and tags are groups.

Greg_Jones · August 6, 2012, 12:35pm

If you agree that groups are not necessarily collections of references and tags are not necessarily properties of documents, explain why it would put “inexperienced users on the wrong footing by routinely claiming that groups are tags and tags are groups”. I find it more helpful to teach new users by using the same terminology that they are going to find in the DEVONthink manual, even if there are some subtle differences in group/tag behavior. These small differences will not make any difference initially to a new user that is struggling to grasp how to use all the power that DEVONthink has. Contradicting the manual just adds more confusion.

edgley · August 6, 2012, 12:41pm

I think I am over guessing what the AI can do to help.
I did write out a long post as to how I think I should now do things based on all your help, but I still didn’t quite get it.

Here is what I currently have:

RSS Stories
From various feeds, about lots of different things. Am converting stories I want to keep to pdf, adding tags like I would keyword a photo, then moving to a group called RSS Keep

Scanned Paperwork
Not very much. Was just going to tag and dump in a group called Scanned

Manuals
From software to things like my camera and telescope, pdf

Tutorials
For the same range as manuals, but video and pdf

Things I need to
Watch / Read / Listen / Examine / Download / Buy
These are either found directly, or are a variable on something already in DT

Things I might need
stories from RSS / webpages / images / video / anything I want to dump
No idea what they are for, or what they might be, just that I might want to use it one day

And here is what I am trying to achieve:

A place to store my paperwork on the slim chance I need it
A place to store manuals // tutorials so I can browse through them all
A place to store things that I like, no real use for them
A place to store things that I might need, for what is not known yet

And here is what I want to use all this stuff for:

I am developing a video game.
This needs a huge mass of information, ranging from stuff I find, to stuff I create.
When I find stuff, I will either need the whole thing (like an image) or a reference to part of it one bit of information (but might referrer again to it somewhere else).
There will be notes I create detailing the specifics of the game as well.

There are currently two game concepts I am investigating, so there will be two of these mass information pots.

However, as well as information specific to the game, there is information that is specific to designing a game; creating levels, learning software, etc

I am planning on using Curio as the creative space for each concept and DT as the black hole, with built in spot-light

Thanks again for helping me get all this, tis really appreciated.

korm · August 6, 2012, 1:21pm

.