Thursday, February 23, 2006

Installing ProCite 5 with MS Word 2003

I recently got myself a very nice Compaq nw8240 laptop from work. Of course, that means the tedious process of reinstalling everything. One of the tools I use a lot with MS Word documents is ProCite 5. I still prefer ProCite over the other citation tools I've tried, even though it hasn't been updated for ages. Having reinstalled the original ProCite CD on my new machine, I then applied the Office XP/WP 10 Patch. Even that, however, wasn't enough to get cite while you write (cwyw) working. However, it's an easy step from there: just copy pc5wd32.wll and pc5wd8.dot from the cwyw sub-directory of the ProCite install dir (defaults to c:\program files\ProCite5) to %APPDATA%\Microsoft\Word\STARTUP. APPDATA is user-adjustable location, but the default is c:\documents and settings\<yourlogin>\Application Data. Sorted.

Del.icio.us: procite, ms-word

Friday, February 17, 2006

Jena tip: optimising database load times

Loading lots of data into a persistent Jena model can often take quite a bit of time. There are, however, some tips for speeding things up.

Let's get the baseline established first. Assume that our data source is encoded in RDF/XML, and the load routine is loadData. I generally use a couple of helper methods make things a bit smoother in my database code. In particular, I use a short name or alias for each database I'm working with, and store the connection URI, model name, user name etc in a table (usually in code, but it could be loaded from a file). I'm not going to dwell on this pattern in this blog entry, since it's not the point of the article. Suffice to say that getDBUrl returns the connection URL for the database (i.e. the JDBC URL) and so on for the other methods.

Given that, the primary method here is loadData, which opens the named model from the database, then reads in the contents of a file or URI. source is the file name or URL pointing to the input document:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    FileManager.get().readModel( model, source );
}

private ModelMaker getRDBModelMaker( String dbAlias ) {
    return ModelFactory.createModelRDBMaker( getConnection( dbAlias ) );
}

private IDBConnection getConnection( String dbAlias ) {
    try {
        Class.forName( DBDRIVER_CLASS );
    }
    catch (ClassNotFoundException e) {
        throw new RuntimeException( "Failed to load DB driver " + DBDRIVER_CLASS, e );
    }
    return new DBConnection( getDBUrl( dbAlias ),
                             getDBUserName( dbAlias ),
                             getDBPassword( dbAlias ),
                             DB );
}

This works, but given any significant amount of data to read in it will usually be very slow. The first tweak is always to do the work inside a transaction. This won't hurt if the underlying DB engine doesn't handle transactions, but will greatly help if it does:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    model.begin();
    FileManager.get().readModel( model, source );
    model.commit();
}

In practice, there should be a try/catch block there to roll back the transaction if an exception occurs, but I'm leaving out clutter for educational purposes!

This probably still isn't fast enough though. One reason is that, to fulfill the Jena model contract, the database driver checks that there are no duplicate triples as the data is read in. This requires testing for the existence of the statement prior to inserting it in the triple table. Clearly this is going to be a lot of work for large sets of triples. It's possible to turn off duplicate checking:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    model.setDoDuplicateCheck( false );
    model.begin();
    FileManager.get().readModel( model, source );
    model.commit();
}

The problem with this is that it moves the responsibility for ensuring that there are no duplicates from the db driver to the calling code. Now, it may well be that this is known from the context: the data may be generated in a way that ensures that it's free of duplicates. In which case, no problem. But what if that's not certain? One solution is to scrub the data externally, using commonly available tools on Unix (or Cygwin on Windows).

First we migrate the data to the n-triple format. N-triple is a horrible format for the human reader to read, but ideal for machine processing: every triple is on one line, and there is no structure to the file. This means, for example, that cat can be used to join multiple documents together, something that can't be done with the RDF/XML or N3 formats. Jena provides a command line utility for converting between formats: rdfcat. Let's take a simple example. Here's a mini OWL file:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:owl="http://www.w3.org/2002/07/owl#"
  xmlns="http://example.com/a#"
  xml:base="http://example.com/a">
  <owl:Ontology rdf:about="">
    <owl:imports rdf:resource="http://example.com/b" />
  </owl:Ontology>
  <owl:Class rdf:ID="AClass" />
</rdf:RDF>

Which we then convert to n-triples:

[data]$ java jena.rdfcat -out ntriple a.owl
<http://example.com/a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Ontology> .
<http://example.com/a> <http://www.w3.org/2002/07/owl#imports> <http://example.com/b> .
<http://example.com/a#AClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .

Assume that we have a collection of n-triple files (a.nt, b.nt, etc) and we want to remove all of the duplicate triples. Using common Unix utilities, this can be done as:

cat a.nt b.nt c.nt | sort -k 1 | uniq > nodups.nt

The sort utility sorts the input lines into lexical order, while -k 1 tells it to use the entire line not just the first field (sort splits lines into fields, using whitespace as a separator). uniq condenses adjacent duplicate lines into one, which is where the duplicate triple removal happens.

Finally, what do we need to change in the original program to load n-triples instead of RDF/XML or OWL files? Happily, nothing! The Jena FileManager uses the extension of a file to guess the content encoding. *.nt triggers the n-triple parser, so since we used that convention in naming the file we're done.

On a recent project, loading a million-triple model into a MySQL 4 database took me just about 10 minutes using these tips, while before optimisation it was taking hours.

Updated: oops - forgot to tag this entry ...

del.icio.us: , , , .

technorati: , , , .