Taking a closer look at EPUB05 Oct 2016 | by Scott Nesbitt
(Note: This post is adapted from part of a presentation I gave at FSOSS 2011)
I don’t have to tell you how publishing, and preparing documents for publication, has changed in the few decades. Nowhere has this change been more pronounced than in the production of ebooks. Since the 1990s, ebook formats have bloomed like a thousand flowers. Many of them died on the vine, while others persisted or were overshadowed by newer formats.
One format that’s taken a dominant place among ebook formats is EPUB. You might be surprised, though, to learn that a number of people (including those in our profession, don’t really understand what EPUB is and how a file is structured.
I’ve been using EPUB for my ebooks and they generally outsell the PDF versions of those books by a margin of about 2 to 1. And I’ve found that a basic knowledge of the innards of an EPUB file can come in handy when troubleshooting problems with a book.
Let’s take a closer look at the EPUB format.
A bit of background
EPUB is an electronic book format that’s based on open standards. More on this in a few paragraphs. It became a standard in 2007, and has a number of useful features. These include the ability for an EPUB file to flow to fit the size of a screen, the ability to embed metadata, and support for audio and video. The latter, though, depends on the ebook reader software and hardware you’re using.
EPUB, as you’ve probably guessed, is supported by a number of e-readers and ebook reader apps. The notable exception is Amazon’s Kindle. You can, however, use Amazon’s KindleGen utility to convert an EPUB file to a format that’s more palatable to the Kindle.
Of course, there are a number of tools — both commercial and open source — that let you either natively author EPUB files or convert other formats to EPUB.
EPUB, although a young upstart, had the advantage of coming into its own in the right place at the right time — at the dawn of the latest surge in mobile devices. The format was built for viewing on a screen. I don’t think that print was even an afterthought when EPUB was created.
Now that’s out of the way, let’s put on our x-ray specs and take deeper look into an EPUB file.
Peering into an EPUB file
Remember when I wrote a few paragraphs back that EPUB is based on open standards? Well, those standards are XHTML, XML, and CSS.
The text of a book is XHTML. Yes, the same XHTML that’s used to create Web pages. So if you have existing content — for example, articles that have been published on your website or as blog posts — you can use that content as the basis of a book.
CSS, if you don’t know, is short for Cascading Style Sheets. Cascading Style Sheets let you apply formatting to a web page. Think of a CSS file as being like a template in a word processor. By changing attributes in a CSS file, you can change the look and feel of an EPUB file.
XML comes into play with an EPUB table of contents file (named toc.ncx) and a metadata file (named content.opf).
The table of contents file provides both the structure of your ebook and the navigation within the EPUB file. Yes, a true table of contents.
The metadata file, obviously, contains information about the book — like its title, author, language, the software used to publish it, and the like. This is information that readers rarely, if ever, see but which should in an EPUB file to make it complete.
EPUB files have the extension .epub. What a surprise … But it isn’t some esoteric and murky format like, say, .doc. It’s actually a ZIP file. You can open an EPUB file using any file compression utility. Just change the ebook’s extension to .zip and double click. You’ll see something like this:
Here’s what the top-level directory of the file looks like:
The folder META-INF contains some basic metadata for the book. The folder OEBPS is where the XML, XHTML, CSS, and any graphic files reside.
Ebook readers expect this directory structure, and it becomes very important when validating a book. I’ll be discussing that next.
Validation and testing
So you’ve got a nicely-formatted EPUB file. Now, all you have to do is let it loose into the wild. Not so fast. You can do that, but it’s not the best move. Before offering your EPUB for download or for sale, you should validate and test it first.
Let’s take a look at both processes.
Validation is the process of making sure that your EPUB books contain all the elements that ebook readers expect. Like what? Here’s a partial list:
- Complete metadata
- The proper directory structure in the EPUB file
- Valid XHTML
- Working links and references to files in the EPUB file
- A table of contents
And a lot more. If you don’t validate your EPUB book, chances are it will render properly in your ebook reader. But why take that chance? But don’t worry: validation isn’t difficult to do. There are some good software and services that let you do just that.
If you use the open source ebook authoring tool Sigil, you’ll find that it has a built-in validator. All you need to do is open your EPUB file in Sigil, click a button, and after a few seconds it points out any problems.
If you don’t want to do that, then download and install epubcheck. epubcheck is what powers the IDPF validator. It’s a command line Java application that’s quite easy to use. Just run the command:
java -jar epubcheck-0.9.2.jar ebook_file.epub
That seems simple enough, doesn’t it? There is one catch, though. Validators are great at finding problems. But in many cases, they’re lacking when it comes to explaining what those problems are, specifically. The validators assume that you have a level of knowledge and the knowledge to fix the problem. That’s not always the case.
When I was validating an ebook, I got an error message telling me that there was invalid HTML syntax in a particular file. I went to the line number that the validator pointed to in the file, and I didn’t see anything wrong. And I have a good knowledge of HTML. Well, it turned out that the validator was expecting paragraph tags around text surrounded by blockquote tags. I only figured that out by running the offending HTML file through an HTML validator.
Like validation, testing is optional. But it’s worthwhile doing it, if only as a final quality check. Crossing “i”s, dotting “t”s, making sure that line and paragraph breaks are accurate. That sort of thing.
In a perfect world, someone publishing an ebook would have access to one of every device on which people read electronic books — ebook readers, tablets, and smartphones. Sadly, it’s not a perfect world.
So, what do you do? Use the devices that you have. They should give you a good idea of how your ebook will look when people read it. Also, consider using Calibre, an open source ebook management application for desktop and laptop computers. While it’s not (as some people believe) a intended as a tool for reading ebooks, Calibre does have a solid ebook reading feature. One sneaky trick you can use is to resize Calibre’s ereader window to simulate how your ebook will look on screens of various sizes.
Chances are, you won’t find many (if any) problems.
I don’t know about you, but I find EPUB to be an interesting format. Not just from the perspective of someone who writes and publishes ebooks, but also from a slightly more technical perspective. While I don’t need, and don’t plan to acquire, a deep technical understanding of EPUB knowing something about the format’s innards can be useful.Thoughts? Let's start a conversation on Twitter.