Sunday, April 24, 2005

DocBook investigation: progress update

Quick update on the DocBook investigation. I've tried a number of schema-aware XML editors for generating DocBook sources. Didn't like any of them. The problem, I think, is one of familiarity: I don't know the DocBook schema very well, so vanilla schema-assisted editing doesn't give me enough support. I tried saving OpenOffice documents as docbook. This works, but doesn't seem to offer much in the way of fine-grain control of the generated XML. Also, I couldn't get it to round-trip nicely. The best solution I've found so far is XMLMind. There's a free standard edition and a payware professional edition. I've only tried the standard edition so far. It's a Java/Swing application, with a slightly odd feel to the UI, but I quickly found myself adjusting. So far, it has been easily the most effective solution I've found for editing DocBook in XML, but with a WSIWYG (-ish) presentation. I've had to step outside the editor and directly hack the XML once or twice, for example to insert XInclude instructions to modularise my thesis into one-chapter-per-file chunks. But XMLMind was easily able to cope with the XIncludes once I had entered them. There may be a way of doing XInclude from the interface, but I couldn't see it. The standard edition of XMLMind doesn't generate PDF files: you need the professional edition for that. However, I yum install'ed fop from JPackage.org, and that works fine. It was nice to see that XMLMind is very up-to-date with the XSL stylesheets from SourceForge for transforming DocBook.

Next goal is bibliography processing. DocBook can handle references already, I just need my refs in the appropriate format. I have a large collection of existing reference data in ProCite for MS Windows. For managing the references on Linux, RefDB looks like a good choice. Unfortunately, I've not been able to install it so far due to incompatibilities with libdbi on Fedora Core 3. I've asked on the refdb list to see if anyone has a solution. It may also be that Fedora Core 4 has the more up-to-date libraries when it ships (the problem isn't just libdbi, but conflicts with the FC3 installed MySQL and PostgresQL). Fingers crossed. In the meantime, I haven't yet found a ProCite to refDB translator. ProCite can export data as a comma-delimited field, but the meanings of the fields are context-dependent on the reference type. I have a sinking feeling I may end up writing my own ProCite to XML converter. Sigh.

Final note: I've been using Bob Stayton's DocBook XSL: The Complete Guide, second edition, as one source of assistance in learning my way around DocBook's world. It's an excellent resource, thoroughly recommended.

del.icio.us: docbook

Friday, April 22, 2005

Jena tip: namespaces and the j.0 problem

A frequently asked question we get on the Jena list is paraphrased as: "help - my output contains these weird j.0 namespaces, how do I get rid of them?". In the hope that Google will save some future askers of this question some time, here's an explanation of what is happening, and what to do about it.

First, consider the following code snippet:

    public static void main( String[] args ) {
        Model m = ModelFactory.createDefaultModel();
        Property p = m.createProperty( "p" );
        Resource r = m.createResource( "r" );
        r.addProperty( p, 42 );
        m.write( System.out, "RDF/XML" );
    }

This could be expected to write a representation of the simple RDF model r p "42". But in fact, it produces a Java exception. Exactly which exception depends on the version of Jena we are using, in my current test setup I get Exception in thread "main" com.hp.hpl.jena.rdf.arp.RelativeURIException: No scheme found in URI 'p'. The problem is that RDF (and RDFS, and OWL) expect the names of things to be URI's. The symbol p isn't a URI. So let's change the example slightly:

    public static void main( String[] args ) {
        Model m = ModelFactory.createDefaultModel();
        String NS = "http://example.com/foo#";
        Property p = m.createProperty( NS + "p" );
        Resource r = m.createResource( NS + "r" );
        r.addProperty( p, 42 );
        m.write( System.out, "RDF/XML" );
    }

OK, now this runs and produces the following output:

<rdf:RDF
    xmlns:j.0="http://example.com/foo#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
  <rdf:Description rdf:about="http://example.com/foo#r">
    <j.0:p>42</j.0:p>
  </rdf:Description>
</rdf:RDF>

So here's the mysterious j.0 appearing. What's going on? The j.0 is an XML namespace. It's defined in the root element of the RDF file:

    xmlns:j.0="http://example.com/foo#"

To get the full URI, just replace j.0: with the URI defined in the namespace declaration. But why was it put there at all? Consider the alternative. With RDF's striping XML syntax, elements are alternately resource and property names. Suppose we hadn't used a namespace for p:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
  <rdf:Description rdf:about="http://example.com/foo#r">
    <http://example.com/foo#p>42</http://example.com/foo#p>
  </rdf:Description>
</rdf:RDF>

<http://example.com/foo#p> isn't a legal XML element name. It contains characters (such as colon) that are not syntactically permitted in an XML element name. So, using XML's namespace mechanism gets us out of a hole when we want RDF identifiers in XML to be URI's. It also has value in its own right though: semantically your p relation may denote something different to my p relation; if we put them in different namespaces there's much less chance of an accidental confusion of semantics.

So, now that we know why j.0 appears, what can we do? One solution is to not use XML output. The same example, written in N3 format instead of RDF/XML, becomes:

<http://example.com/foo#r>
      <http://example.com/foo#p>
              "42" .

No funny prefixes in sight. Alternatively, we can just ensure that we use a sensible name for the namespace instead of Jena's autogenerated j.0, j.1, etc. The key to this is the PrefixMapping interface, which is a super-interface of Model. The method setNsPrefix lets us assign a more meaningful (to human readers!) namespace:

    public static void main( String[] args ) {
        Model m = ModelFactory.createDefaultModel();
        String NS = "http://example.com/foo#";
        m.setNsPrefix( "eg", NS );
        Property p = m.createProperty( NS + "p" );
        Resource r = m.createResource( NS + "r" );
        r.addProperty( p, 42 );
        m.write( System.out, "RDF/XML" );
    }

Producing:

<rdf:RDF
    xmlns:eg="http://example.com/foo#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
  <rdf:Description rdf:about="http://example.com/foo#r">
    <eg:p>42</eg:p>
  </rdf:Description>
</rdf:RDF>

del.icio.us: jena, semantic web.

Sunday, April 17, 2005

Delving into DocBook

For more years than I'm prepared to admit (even to myself ... no, especially to myself) I've been pursuing a part-time PhD at the University of Liverpool. It has taken much longer, and been much harder work than I anticipated, and I expected it to be quite hard work. Still, I met with my supervisor on Thursday last week, and we've agreed that I've done enough to start writing up now. The exact criteria for getting a PhD in the UK system are somewhat unclear (to me), but in essence the PhD is regarded as a research training, and the successful candidate needs to show a thorough understanding of the reseach area and to have made a contribution to knowledge. The primary assessment is made on the thesis, and that's what I have to produce next. Breakthrough discoveries are not required, which is good because I feel I've opened up more questions than I've answered. Actual breakthoughs: none; contributions to knowledge: well, let's see when I've finished the thesis. Emotionally I'm not sure that I've got to where I wanted to be when writing up, but I suspect that's misplaced perfectionism and hubris at work.

This brings me to the actual subject of this posting: which writing tool to use? At the office (i.e. my day job) we generally use the MSOffice suite, but I'm not going to write my thesis in Word. Reasons: (i) I've had Word crash on me and lose content or formatting information, and it's just tedious to recreate lost work, (ii) Word does this more often on long, multi-part documents which is exactly what I'm going to be writing, (iii) I want to generate decent HTML from the finished thesis so that I can put it up on the web, Word's HTML output is vile, and (iv) I don't want to be locked-in to a proprietary binary format forever. Many of my academic friends and colleagues use LaTeX for writing. I used to be a LaTeX user many, many years ago, but those skills have completely atrophied. Plus, I've seen the results of some LaTeX to HTML converters, and the results are simply horrible. So I've decided to try a brave experiment and use DocBook.

Reasons for choosing DocBook? Well, mostly the opposite of the strikes against Word. It's a well-tested, open, text-based format. DocBook is specifically designed to generate multiple output presentations (HTML, PDF, XML) from a single source. There are lots of DocBook tools, and assuming they tend towards some sort of normal distribution, some of them must be out on the right-hand tail! It is a problem, however, that there are just so many tools around. I'm finding it very daunting figuring out where to start. So, as a DocBook neophyte I'm going to try to capture some of my baby steps and discoveries as I go. Add some rocks to the way-markers as I pass.

Some of the features and issues I'm going to be looking out for:

  • Generating documents that have separate chapters, but allow me to cross reference between chapters;
  • Generating XHTML output as both page-per-chapter (or page-per-section) and one-page-per-document;
  • Inserting meta-data into the XHTML output, including id attributes so that I can cross-reference the XHTML content;
  • Using CSS stylesheets with the output XHTML;
  • Automated indexing;
  • Bibliography support (I will really miss ProCite as I move away from Word, and I need to find a way of migrating my large ProCite database to DocBook's world);
  • Equations and formulae;
  • Source editors, whether on the raw XML or a WSIWYG view onto the document content.

Usually when I learn a new technology my strategy is to hit Amazon for a good book. Not the For Dummies kind, but something that gives a pretty good map of the territory then starts getting into detail without too much flapping about. O'Reilly books are generally pretty good exemplars of the style I like. The print version of the O'Reilly DocBook book, however, is seriously out-of-date. You can read the up-to-date version online, but that ranks a poor second in my experience. As I find better learning resources, I'll try to remember to blog them here.

del.icio.us: docbook

Monday, April 11, 2005

Passionate about research?

I don't normally bother with the whole blog-rolling thing. There are plenty of smart people out there discovering and linking interesting things far more assiduously than me. An exception today, though.

Shelley recently linked to Kathy Sierra's creating passionate users blog. I've been reading for a few days now. Great stuff! I haven't read any of Kathy's team's books yet, but I certainly plan to. If they're anything like the quality of writing in the blog, they should be very, very good.

A recent post on Kathy's blog was You and your users: casual dating or marriage?. I won't repeat the stories here (go read them for yourself, you'll enjoy the experience), but the take-home is that making your users passionate about your product turns them from just customers to active advocates for your business. Great! I can really see how that applies in a commercial context. But. But I work in research. Corporate research, to be sure, not academia, but nonetheless I've been wondering how Kathy's ideas might apply in a research context. Because the essence of an academic-style research training is to be dispassionate. To take ideas, pull them apart with the surgeon's tools of statistics, peer-review and analytical cynicism, and lay them out on the slab for inspection. Reports written in the third-person passive-voice, striving for the measured tones of the respected sage.

In one way the comparison is clear: when we try to get other parts of the company interested in the ideas we're working on in the lab (what my friends at BT Labs call down-streaming), I can see that our internal customers could get passionate about the research we're pushing. Even then it's slightly different because the stuff in the lab is usually not finished. There isn't a great pair of skis to try out, though we may have a completely new and half-finished shoe clamp (NB I know absolutely nothing about skiing, except that it involves gravity in some way). Worse, often what we're actually asked for is a set of slides to summarise our work. I don't believe that anyone ever got passionate about a PowerPoint slideset. Ever.

But even more of a puzzle is how to get passionate in peer-reviewed research. Or even if that would be a good thing. I have to say, though, that most of the conferences and workshops I attend, even the good ones, are pretty dreary things. Maybe it's the format. Maybe it's the type of people who attend. Maybe it's some kind of cultural meme we all get innoculated with. But I often wonder how much value the delegates really get from such events. Especially when, as is all too common, the audience sits mutely through a presentation, says little in the Q&A, and then carps in the corridor afterwards about the poor assumptions the presenter made.

It certainly is possible to get excited about research ideas. A number of times I've had the mind-expanding experience of reading a paper and getting a real sense of new avenues of exploration, or products, being opened up. It's a rush, but it doesn't happen very often, more's the pity. Something to work on.

Friday, April 08, 2005

Jena tip: navigating a union or intersection class expression

One of the things I spend a lot of my time doing is answering Jena questions. Historically, the search capability at YahooGroups has been atrocious. For some time, I've been thinking that blogging some Jena tips for Google to find would be a good idea. I'm told that YahooGroups' search capability has been improved recently, nonetheless I'm going to try blogging some of the more common issues and FAQ's as they come up. Maybe it will save someone some time, and me some email effort.

One frequently asked question is how to get classes out of a union or intersection class expression. Suppose you have some OWL like this:

  <owl:Class rdf:ID='StateMachine'> 
  <owl:equivalentClass> 
    <owl:Class> 
      <owl:intersectionOf rdf:parseType='Collection'> 
        <owl:Restriction> 
          <owl:onProperty> 
            <owl:ObjectProperty rdf:about='#state'/> 
          </owl:onProperty> 
          <owl:someValuesFrom> 
            <owl:Class rdf:about='#State'/> 
          </owl:someValuesFrom> 
        </owl:Restriction> 
        <owl:Class rdf:about='#Automaton'/> 
      </owl:intersectionOf> 
    </owl:Class> 
  </owl:equivalentClass> 

A state machine is the intersection of class Automaton with things that have a state property. It's just a synthetic example, don't sweat the details! First, here's some code to list the elements of the intersection:

  OntClass nfp = m.getOntClass( NS + "StateMachine" );
  IntersectionClass ec = nfp.getEquivalentClass()
                            .asIntersectionClass();

  for (Iterator i = ec.listOperands(); i.hasNext(); ) {
      OntClass op = (OntClass) i.next();

      if (op.isRestriction()) {
          System.out.println( "Restriction on property " + 
                              op.asRestriction().getOnProperty() );
      }
      else {
          System.out.println( "Named class " + op );
      }
  }

Two key points here: first, Jena uses .as() to convert between views, or facets of RDF resources in the model. Since RDF resources can change type according to what's asserted in the model, ordinary Java casting doesn't work because it's too static. The .as() mechanism is really a form of dynamic polymorphism. The generic form of .as() takes the facet's Java class as a parameter, but the Ontology API classes provide various convenience methods with the pattern .asXYX. Hence .asIntersectionClass().

The second key point, and the one that many people don't seem to notice in the documentation, is that UnionClass and IntersectionClass are both instances of BooleanClassDescription, which presents a variety of means for accessing the members of the intersection or union. listOperands() returns an iterator whose values are the class in the intersection or union.

del.icio.us: jena semanticweb

Sunday, April 03, 2005

owl:minCardinality is not minUtility

The open world assumption can cause initial confusion to people trying to get to grips with the semantic web. The OWA states, in essence, that just because you don't know something to be true, you can't assume it to be false. For example, let's assume that Mary says her father is Fred (call this S1). She also says that her father is George (S2). If Fred and George actually referred to two different people, Mary's statements would be inconsistent because people only have one father (well, under normal conditions). But, if we knew that Fred was known as George to his work-mates, for whatever reason, so that Fred owl:sameAs George is true, Mary is being consistent. Let's call that equality S3. The Open World Assumption states that knowing only S1 and S2, we can't assume the negation of S3 (written ¬S3). The Closed World Assumption (CWA) allows us to infer ¬S3 if we don't actually know whether S3 is true or false. The CWA is also referred to as negation as failure, and will be familiar to anyone who has ever programmed in Prolog. Note that there's a separate-but-related idea, also well-discussed ontology design, called the unique names assumption. The UNA means that things with different names are always different, even under the open world assumption. If the UNA applied, statement S3 would automatically be a contradiction. OWL explicitly makes the open world assumption and not the unique names assumption. This entry, however, is about the OWA.

So far, so good. Now, let's suppose the following:

  <owl:Class rdf:ID="Person">
    <rdfs:subClassOf>
      <owl:Restriction>
        <owl:onProperty rdf:ID="hasParent"/>
        <owl:minCardinality rdf:datatype="&xsd;int">1</owl:minCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
  </owl:Class>
  <Person rdf:ID="mary"/>

For readers not familiar with OWL, this says, roughly, "the class Person is a sub-class of the class of all things that have at least one parent". That is, all Persons have at least one parent, but there may be some things that have at least one parent that are not Persons. Moreover, we note that Mary is a person. Many people, particularly those used to XML-Schema validation, would expect an OWL validator to complain that Mary doesn't have a declared parent, in violation of the class description. Indeed, this is a frequently asked question on the jena-dev list. But the OWA means that just because we don't know, in this local fragment of the knowledge base, that Mary has a parent we can't assume that she doesn't have one at all. Mary's parent might be declared in some other KB that isn't current visible to whoever or whatever is doing the reasoning. In fact, OWL reasoners (including Jena's built-in rule reasoner) will deduce that Mary does have at least one parent, we just don't know the identity of that parent yet.

Consequently, owl:minCardinality will rarely cause a validation error, and never on its own. So, does this mean that min cardinality has no value, or was put in by mistake, as some have suggested? No. The key point, I think, is that ontologies are not schema languages. Thinking of OWL as a complex data-description language leads to the wrong assumptions. One use for an OWL ontology is to let you make additional deductions about your instance data. In this case, min cardinality allows reasoners to infer class membership by classifying the instance data using the ontology. For example, in one of my ontologies I have:

  <owl:Class rdf:ID="AnyGoalStrategy">
    <rdfs:comment>A goal strategy in which any sub-goals can succeed</rdfs:comment>
    <owl:equivalentClass>
      <owl:Class>
        <owl:intersectionOf rdf:parseType="Collection">
          <owl:Class rdf:about="#GoalStrategy" />
          <owl:Restriction>
            <owl:onProperty rdf:resource="#any" />
            <owl:minCardinality rdf:datatype="&xsd;int">1</owl:minCardinality>
          </owl:Restriction>
        </owl:intersectionOf>
      </owl:Class>
    </owl:equivalentClass>
    <owl:disjointWith rdf:resource="#SequenceGoalStrategy" />
    <owl:disjointWith rdf:resource="#PerformGoalStrategy" />
  </owl:Class>

An AnyGoalStrategy instance is recognised as a GoalStrategy resource that has an any relation to a sub-goal. So I don't have to explicitly declare the types of my strategy objects, I just let the reasoner figure them out for me. It's just a small example, but I think it points the way to the utility of owl:minCardinality and other constructs, even in the presence of the open world assumption.

[Updated to correct a syntax error in the second example].

Call for papers: AWeSOMe '05

Among other workshops, I'm on the programme committee for AWeSOMe 2005 this year: The First International Workshop on Agents, Web Services and Ontologies Merging. Can't say I'm overly taken with the name, but I had no part in choosing it! I agreed to be on the PC because I do strongly believe (a) that the agent-computing model has something to offer, over and above SOA, and (b) that attempting to build business-focused agent systems today on anything other than a web-services platform is futile. Not that all is well in WS-* land either; other writers have covered in detail the current confusion of WS standards and goals. However, web services do have a strong degree of momentum, tool support, and mind-share. A goodly part of many agent toolkits is distributed-systems infrastructure: precisely the kind of capability that WS-* covers. While agent platforms may have invented some of those wheels, the SOA people have re-invented them, and better in some cases, and have better glossy brochures for their wheels. Time to accept it and move on.

In fact, I'd go so far as to say that a good research topic for someone is to pick a coherent subset of the WS-* collection that addresses the needs of multiagent systems, and write it up with a view to promoting a firm architectural foundation and principles for interop. Similar to what the Web Services Interoperability Organization has done for vanilla web services. Maybe this would be one role for FIPA as it re-invents itself with its new management structure. We'll see.

Saturday, April 02, 2005

Windows backup to DVD+RW

My main workhorse is a Linux workstation, but I have an HP Pavilion running WinXP that the kids use for games and homework. I've been feeling guilty that I don't have a decent backup strategy for that computer. Fortunately, a simple solution has presented itself: via Joel on Software came a link to The Daily Grind at Larkware.com, via Daily Grind no. 591 comes news of FireStreamer DVD which makes DVD+RW devices appear as tape drives that Windows Backup can see. Haven't tried it (yet), but it looks like a perfect solution. I have tried other payware tools for backing up my laptop to DVD+RW, but couldn't actually get them to work. At all. Fingers crossed for FireStreamer. I have a good feeling however: they haven't tried to solve the whole problem of doing backup, they're just fixing the obvious breakage that the builtin backup tool can't write to DVD. Elegant.

Coming up for air

Boy, it has been a busy few months. I've been completely buried in work, so no time to blog. Not sure that I'm any less busy now, but I think I need to resurrect this blog anyway. Nuin presses on, the current CVS head has improved RDF and web-service integration, and I've re-written the interpreter to improve backtracking behaviour. Plus I've added a whole new section on storing goals and strategies, encoded in RDF. No documentation yet, sorry. Soon - promise.

What else? There's a new release of Jena coming very shortly, so that's keeping me busy too. Plus we have a new boss, though it's fair to say that all of the drama in the HP boardroom hasn't directly affected our group much - we just keep working on.

Other 'what else?' changes include migrating my development environment over to Fedora Core 3. I've been using RedHat/Fedora on and off for about three years now, but I've always used Windows as my main coding platform. No more. Windows has got less stable for me, and Fedora 3 is just the business compared to earlier incarnations of RH Linux. Of course, FireFox helps ... I never did get on with plain Mozilla, or Galeon. Since I work from home a lot, I also wanted to upgrade my home office computer. Since I don't have a huge budget for indulging in new technology, price/performance was important. I ended up ordering a custom build from the fine folks at Phoenix PC's in Jarrow. Graeme at Phoenix PC's was great at helping me select a configuration, and did a good job on the build. My new Pentium IV beast is humming along just fine, though it took a surprisingly long time to re-create my user environment. It's perhaps just a little bit too easy to drop in new RPM's and not quite remember where they came from! Words of praise are due to the good people of jpackage.org for all their efforts in providing a consistent set of yum-able Java library RPM's, and to NVidia for their graphics drivers for Linux. My new machine has a GeForce FX5200; before I installed the NVidia drivers I was getting around 330 frames per second on glxgears, and Tux Racer was just unusable. After installing, which was a breeze, btw, I'm getting 760 fps in glxgears, and Tux is sliding his little penguin tummy into icy oblivion in impressive style. Much to the amusement of my kids.