Calibre: PDFs to ePub Conversion Tips
A couple of weeks ago, after this post about inserting the blurb at the front of an ebook, a reader emailed me asking about cleaning up PDF files when converting to ePub. Often there will be stray letters or numbers or headers that affect the output of an ebook. There are some tips and tricks in Calibre’s PDF conversion engine that can be used to produce very clean and readable PDFs. I’m going to address three of the most common problems when converting a PDF to ePub and what you can do to address those problems.
1) Line numbers.
Some PDFs have line numbers that are on a hidden layer. When you read the PDF you can’t see them, but when you convert to ePub or Mobi they appear within the text and render the converted book unreadable. There’s a very easy fix to this.
Under “Convert books” select “Search and Replace”:
Make sure you select PDF as your input and then either Mobi or ePub (or whatever format you prefer) as output. As the screenshot says, search and replace uses regular expressions. You can read more about it here. Don’t be afraid. You can do this!
The first thing you want to do is click on the wizard button. This will bring up a dialogue box that shows what the PDF looks like before conversion. There’s a number followed by <br>, then a line break. You don’t want any of that.
To remove the numbers and the <br> which is CSS for a line break, your code would be:
\d+<br>\n
- The \d+ tells the program that you want to remove every digit whether it is 01 or 501. The letter “d” in regexpression stands for single character that is a digit. The “+” is a greedy qualifier which means it will remove as many digits as possible. In a fiction book, this is a good thing.
- <br> is the CSS code
- \n is the line break or end of line.
You can click “Test” to determine whether your expression is going to strip out the right text. The yellow highlighted text will be removed:
As you scroll down, sometimes you will see these line numbers that start with S or N.
or sometimes the line numbers will end with S and N.
To construct the search text, you would simply add the S or N before or after the number.
S\d+<br>\n
N\d+<br>\n
\d+S<br>\n
\d+N<br>\n
Or, if you’re really savvy, you’ll use the \w code. \w removes a single word character.
\d+<br>\n|\d+\w<br>\n|\w\d+<br>\n
The | is called a pipe and it is used to separate sets of regular expressions. I’m sure there is a more sophisticated query but it works for me.
Now when you scroll down, those pesky stray line numbers should be highlighted. Press okay and you’ll be sent back to the Search and Replace window. The replace term is left completely blank. Press “Add” or your conversion won’t include the search and replace you just constructed.
Your search expression appears in the left with the replacement text (blank in this case) on the left.
If that’s all you need to exclude, then press “OK” and the conversion will begin. Often, however, there will be headers that need to be removed.
2. Headers.
In the following book there was an alternating header with author name and title. Remember, the number and text will often change from page to page.
<a name=”6″></a><em>Anne Calhoun </em>
<a name=”7″></a><em>Liberating Lacey </em>
and a footer with the page numbers:
6 <br>
<hr />
Basically I copy any text that needs to be removed and then replace the number with \d+. My reg expression is as follows.
<a name=”\d+”></a>)(<i>Anne Calhoun </i><br>
Highlight the text to be removed and then replace the numbers with \d+. You could add a \s+ between “Lacey” or “Calhoun” and the </i> just to be on the safe side: <i>Liberating Lacey\s*</i><br>
The \s+ removes excess white spaces.
3. Line spaces
Calibre has a feature called Heuristic Processing which scans the ebook and tries to search for common errors and fixes them. I use this function primarily to unwrap lines. In this example, you can see the paragraph is broken up by these weird line wraps and reading a book in this format would
be
impossi
ble.
The default for Heuristic Processing is that it is disabled. So check the box and the options will become available. The default for the line unwrap is .40 and frankly that setting usually never works for me.
You often will have to play around with this and reformat at different settings. In this example, I had to reduce the line unwrap factor to 0.5 to get the paragraphs to be readable.
A soft scene break is when there is an extra space between different scenes in a book rather than the use of a wingding (???) or some small graphic such as hashmarks (###) or bullets (•••). You can replace the soft scene breaks with your own text or graphic to further customize the look and feel of your ebooks.
Hope this helps!
And this is why I hate PDFs. There’s so much faffing about to put it on an ereader. I have to be really keen to even bother. Most of the time these days PDF = nope for me.
Nevertheless, thank you for this. It might make life a little easier for those rare times I do accept a PDF. The process I use now is ridiculously complicated and doesn’t always work.
OMG THANK YOU for this tips! I dislike PDF because I mostly read via Kindle, and PDF file is only for when I sneak reading at the office. But converting them to EPUB or MOBI is a pain. So this helps and I will bookmark this for future use
Thanks Jane. I’m another who hates pdf with a passion and I will be trying these tricks.
Thanks for an incredibly helpful post! I’ve accidentally ended up with PDFs a couple of times, and will have to try these tricks.
Thanks for this post, Jane. I don’t have many PDFs anymore, but when I do I use these techniques to get rid of the headers and footers. But I always have to go trawling through MobileRead to find the exact expressions and half the time I get them wrong. This is great.
Great post and very helpful. In the section on headers, is \s* or \s+ correct? Or are they both correct? You refer to it as \s+ in the explanation, but in the code sample you provide \s* is used.
@Kaetrin & @Jayne: Some PDFs convert better than others (at least to the kindle– I usually tell Netgalley to send them there). I’ve only run into a few that had so many issues I had to read them on my laptop screen. This post will be very helpful with those super annoying ones, though.
@Janine: I had one that was so annoying I ended up waiting to finish reading it until the book was released and bought an epub copy just to save my sanity.
OMG!!!! Thank you!!!!!!
@Janine. Either the asterisk or the plus sign should work. The plus sign is more greedy so some suggest using it with caution.
Thank you! I’ve printed out these instructions and I can’t wait to try them out!
@Jane: Thanks for clarifying. The whole post is super helpful.
@Jayne: I was once in a similar situation with a critique partner’s book that I had to read before publication due to a deadline. I ended up printing it out — luckily I have a laser printer, but I felt bad about killing trees. This will be SO much better.
@Janine: The method I use now involves sending to Kindle and converting to AZW3 there and then downloading, converting again to epub – which is time consuming and annoying.
Thank you!!!!
Oh awesome, thank you for this post! I’m enough of a techie that this does make sense to me, and it’s fun to learn more about Calibre’s capabilities. I’ll be sharing this around. :)
Great post! Thanks for the tips…I’m not a tekkie so your post will definitely help. I remember an article I wrote on my blog about conversion tips for self-publishing which includes Calibre. Here’s the link: https://www.chatebooks.com/blog-Ebook-Converter-Tips-for-Self-Publishing
I gotta try this out, I don’t mind reading epubs or kindle books on my computer, but reading PDFs really is a pain… Thank god authors don’t send those too often anymore. You could also do a copy-paste of the book chapter by chapter in the Reedsy Book Editor (https://reedsy.com/write-a-book) and then export the epub file. Not a painless process, but quite simple.
I’ve been having a problem converting pdf to epub. Calibre shows me that it’s busy converting the book, but after a few minutes when I check the process is still going and it’s on 1%. It never goes futher than that. Am I doing something wrong? I’ve been using Calibre for a long time and never had that problem.
hi i might be the first to say it but everything i did showed no change in the output file to be exact i used copy and past for the code just changed the names. where did i go wrong?
Thank you very much for posting this article, I am very thankful that you share this, I will try to learn more about Regex for ebook. Meanwhile I have kindle paperwhite 3rd gen and it seems has issue If I convert pdf to mobi with the new version, but this fixed by using the old format, I cant open the book in kindle, is there anyone have the same problems as I do?
Thank you so much for this! I’m not super techie, but this was still easy to understand. There’s one thing that’s not quite working though; I’ve got all the footers removed from the PDF no problem, but it’s only removing the first four of each of the alternating headers.
The header strings are:
AUTHOR NAME 4
alternating with
TITLE 1 1
So in the Regex field I’ve put (AUTHOR NAME \d+)
Is there something I’m missing that would cause it to only catch the first four instances each of the author and title headers?
Thanks!
@Kat: Sorry, I didn’t realize the comments field would take what I typed as code rather than just text. I’ll try another method:
The original was: htmlspecialchars(AUTHOR NAME 1 2) alternating with
htmlspecialchars(TITLE 1 1)
Which I tried replacing with htmlspecialchars((AUTHOR NAME \d+)) alternating with
htmlspecialchars((TITLE \d+))
@Kat: Argh! I give up. Never mind, and sorry to bother you!!
Nice article, and I learned a few things I didn’t know, even though I’ve been using calibre for this for years!
I’d just like to point out that when you changed “line unwrap factor” from “0.40” to “0.5”, you *increased* it, not *decreased* it. That’s “0.50”, or ½ :-) But yeah, that line unwrap factor is a pain.
Thank you so much!
What about removing “33 S”? The space between the 33 and S allows it to not be filtered out.
@Anders: jk got it. spaces are \s lol
Thank you SOOOOO much!
OMG, thank you for Line spacing tips :) :) :)