Sunday, April 17, 2005

Delving into DocBook

For more years than I'm prepared to admit (even to myself ... no, especially to myself) I've been pursuing a part-time PhD at the University of Liverpool. It has taken much longer, and been much harder work than I anticipated, and I expected it to be quite hard work. Still, I met with my supervisor on Thursday last week, and we've agreed that I've done enough to start writing up now. The exact criteria for getting a PhD in the UK system are somewhat unclear (to me), but in essence the PhD is regarded as a research training, and the successful candidate needs to show a thorough understanding of the reseach area and to have made a contribution to knowledge. The primary assessment is made on the thesis, and that's what I have to produce next. Breakthrough discoveries are not required, which is good because I feel I've opened up more questions than I've answered. Actual breakthoughs: none; contributions to knowledge: well, let's see when I've finished the thesis. Emotionally I'm not sure that I've got to where I wanted to be when writing up, but I suspect that's misplaced perfectionism and hubris at work.

This brings me to the actual subject of this posting: which writing tool to use? At the office (i.e. my day job) we generally use the MSOffice suite, but I'm not going to write my thesis in Word. Reasons: (i) I've had Word crash on me and lose content or formatting information, and it's just tedious to recreate lost work, (ii) Word does this more often on long, multi-part documents which is exactly what I'm going to be writing, (iii) I want to generate decent HTML from the finished thesis so that I can put it up on the web, Word's HTML output is vile, and (iv) I don't want to be locked-in to a proprietary binary format forever. Many of my academic friends and colleagues use LaTeX for writing. I used to be a LaTeX user many, many years ago, but those skills have completely atrophied. Plus, I've seen the results of some LaTeX to HTML converters, and the results are simply horrible. So I've decided to try a brave experiment and use DocBook.

Reasons for choosing DocBook? Well, mostly the opposite of the strikes against Word. It's a well-tested, open, text-based format. DocBook is specifically designed to generate multiple output presentations (HTML, PDF, XML) from a single source. There are lots of DocBook tools, and assuming they tend towards some sort of normal distribution, some of them must be out on the right-hand tail! It is a problem, however, that there are just so many tools around. I'm finding it very daunting figuring out where to start. So, as a DocBook neophyte I'm going to try to capture some of my baby steps and discoveries as I go. Add some rocks to the way-markers as I pass.

Some of the features and issues I'm going to be looking out for:

  • Generating documents that have separate chapters, but allow me to cross reference between chapters;
  • Generating XHTML output as both page-per-chapter (or page-per-section) and one-page-per-document;
  • Inserting meta-data into the XHTML output, including id attributes so that I can cross-reference the XHTML content;
  • Using CSS stylesheets with the output XHTML;
  • Automated indexing;
  • Bibliography support (I will really miss ProCite as I move away from Word, and I need to find a way of migrating my large ProCite database to DocBook's world);
  • Equations and formulae;
  • Source editors, whether on the raw XML or a WSIWYG view onto the document content.

Usually when I learn a new technology my strategy is to hit Amazon for a good book. Not the For Dummies kind, but something that gives a pretty good map of the territory then starts getting into detail without too much flapping about. O'Reilly books are generally pretty good exemplars of the style I like. The print version of the O'Reilly DocBook book, however, is seriously out-of-date. You can read the up-to-date version online, but that ranks a poor second in my experience. As I find better learning resources, I'll try to remember to blog them here. docbook

No comments: