Oct 3 2010
I’ve received a number of questions from readers about what to do with all the PDFs they’ve purchased over the years now that they are ready to migrate to a dedicated ereader. The fact is that there is no perfect conversion of PDFs -> 6″ screen. PDFs were made to be read on a larger screen and the best portable device out there to read PDFs is the iPad, hands down. However, if you can live with some formatting quirks, there are two free programs that will allow you to convert your PDFs to ePub or Mobi to be read on your favorite 6″ screen eink device.
The first program you need is Briss. The reason that you need a program like Briss is so that you can cut out the top of the ebook that contains the title or author and the bottom to cut out the page numbers or any other artifacts.
There are plenty of other programs that do a PDF crop, but few of these create what is known as a destructive crop. A destructive crop is one that permanently reduces the margins of the PDF. Most programs do just a basic crop (even Adobe Acrobat) which means your original document is preserved. Take, for example, the preview program under Mac OS. There is a crop box and a media box. The crop box shows the cropped version but the original (media) version lurks behind. When you go and convert, the conversion program reads from the original but not the cropped version.
1. Preview (Mac ONLY)
Under Preview (Mac ONLY), you can create a destructive crop by cropping your image, selecting “print” and then “print as PDF” in the bottom left hand corner. Choose ” PS” which is postscript. Then you have to open the PS file and print as PDF again.
2. Briss, platform agnostic
Or download Briss. Briss is a small program that does not require installation and runs on any machine that has java. Simply download the folder and unpack the folder somewhere. Look for briss-0.0.10.jar file and double click.
A small dialog box will open and you need to select “Load File”. From there, navigate to the PDF you wish to crop. For simple PDFs like our books, Briss will usually create an image cluster for odd pages and another cluster for even pages. Draw a box around the area of the text by clicking the mouse button and holding and dragging.
If you are unhappy with the box you’ve drawn, simply right click. Right clicking on the purple box and the purple box will disappear and you can redraw your box. If you are happy with your selection, click “Crop PDF” and then a dialog box will open allowing you to save the cropped PDF.
The program automatically adds “cropped” to the name so you needn’t worry about overwriting your original PDF. Open your cropped PDF in your favorite PDF viewing program and make sure you have cropped the right image. From here, you can actually just transfer the PDF to your eink reader if your devices reads PDFs. Sony and Kindle both do. This is what a cropped PDF looks like on the Kindle without conversion (click for larger image):
As you can see the font size of the PDF on a 6″ screen is miniscule. It’s very hard to read. This is where the need for conversion comes into play. You should already have Calibre downloaded and installed but if you don’t, grab it here. Simply drag your cropped PDF onto the Calibre screen or use the Add Books button.
Once the book is in the library, select it with your mouse. You can choose to edit the metadata (author, title, publisher). When you are done editing the metadata, press “Convert books” button. Here you have the option to select ePub (for Sony, nook, iThings) or Mobi (Kindle) as the converted format.
Now, if all you want to do is remove the header and footer text, you can use Calibre’s “Structure Detection” and regular expressions. Sometimes there is hidden text in the PDF (like a footer or a header) and the destructive crop will NOT remove the hidden text.
You will then need to use the Structure Detection option to remove the hidden text. Structure Detection is an option on the conversion page. This was even challenging for me. There is a tutorial here. Basically, for the page numbers, I use this in the footer:
(\d+ <br> <hr>)
For the header, I used this code:
(<A name=\d+>\s*</a>)(<i>Anne Calhoun </i><br>)|(<A name=\d+>\s*</a>)(<i>Liberating Lacey </i><br>)
The parentheses set off each grouping of text you want to remove. The “|” is an or instruction. So here I want to remove the (A name=2></a>) and the author’s name and the title. Use the “wand” to examine your PDF. You will want to pattern your regular expression off the PDF.
You can click “Test” to determine whether your expression is going to strip out the right text. The yellow highlighted text will be removed:
Regular Expression gives me a huge headache so I prefer to use the destructive crop when I can. However, whenever there is a PDF with this hidden text, you will almost always have to use Regular Expressions to remove the header and footer. Here are some Regular Expression shortcuts that might help you:
- (<A name=\d+>\s*</a>) = This will remove everything that starts with <A name and ends with </a>. The \d+ tells the program that you want to remove every digit whether it is 1 or 301 so it doesn’t matter if the code is <A name=1></a> or <A name=301></a> because the + is like wildcard and removes all numbers with <a name= before the number and </a> after the number.
- \n = end of line, used if you need to remove code that is on the next line. I.e.,
- \d+ = removes all numbers from 0 to infinity
- \s* = removes all whitespace characters (those are the blank spaces between words and letters, created with a spacebar usually)
- | = this is called a pipe or vertical bar. I use it to separate sets of regular expressions.
Anne Calhoun’s Liberating Lacey in PDF form from EC contained an alternating header with author name and title. Remember, the number changes every page:
<a name=”6″></a><em>Anne Calhoun </em>
<a name=”7″></a><em>Liberating Lacey </em>
and a footer with the page numbers:
My reg expression is as follows.
1. Remove the author name:
(<a name=”\d+”></a>)(<i>Anne Calhoun </i><br>)
I used the A name code from above and simply copied the <i>Anne Calhoun </i><br> directly from the PDF. Press test and it is all highlighted.
2. Remove Title:
|(<A name=\d+>\s*</a>)(<i>Liberating Lacey </i><br>)
I use the | to separate the sets of text I am removing, use the A name code from above and copy the <i>Liberating Lacey </i><br> directly from the PDF. You could add a \s+ between “Lacey” and the </i> just to be on the safe side: <i>Liberating Lacey\s*</i><br>
3. Remove page numbers:
\d+ to remove the page number + \s* to remove any whitespaces + <br> copied from the PDF + \n because we are moving to a new line + <hr> copied from the PDF.
I know. This is hard. It’s hard for me too. Generally, it takes me some trial and error to figure out the right regular expression code. I hope this helps to start you on the road to demystifying that. I’m not at all experienced in this but I thought I would share what I little I do understand in hopes to help others. Obviously the folks at Mobile Read are far more experienced than I. The best thing to do is just break it down, line by line, letter/digit by letter/digit.
If you had Adobe Acrobat, you could simply use “Document > Header & Footer > Remove”. Adobe Acrobat, however, is $299. If you have a better suggestion for us PDF owners, I would love to hear it!