Friday, February 17, 2006

Jena tip: optimising database load times

Loading lots of data into a persistent Jena model can often take quite a bit of time. There are, however, some tips for speeding things up.

Let's get the baseline established first. Assume that our data source is encoded in RDF/XML, and the load routine is loadData. I generally use a couple of helper methods make things a bit smoother in my database code. In particular, I use a short name or alias for each database I'm working with, and store the connection URI, model name, user name etc in a table (usually in code, but it could be loaded from a file). I'm not going to dwell on this pattern in this blog entry, since it's not the point of the article. Suffice to say that getDBUrl returns the connection URL for the database (i.e. the JDBC URL) and so on for the other methods.

Given that, the primary method here is loadData, which opens the named model from the database, then reads in the contents of a file or URI. source is the file name or URL pointing to the input document:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    FileManager.get().readModel( model, source );
}

private ModelMaker getRDBModelMaker( String dbAlias ) {
    return ModelFactory.createModelRDBMaker( getConnection( dbAlias ) );
}

private IDBConnection getConnection( String dbAlias ) {
    try {
        Class.forName( DBDRIVER_CLASS );
    }
    catch (ClassNotFoundException e) {
        throw new RuntimeException( "Failed to load DB driver " + DBDRIVER_CLASS, e );
    }
    return new DBConnection( getDBUrl( dbAlias ),
                             getDBUserName( dbAlias ),
                             getDBPassword( dbAlias ),
                             DB );
}

This works, but given any significant amount of data to read in it will usually be very slow. The first tweak is always to do the work inside a transaction. This won't hurt if the underlying DB engine doesn't handle transactions, but will greatly help if it does:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    model.begin();
    FileManager.get().readModel( model, source );
    model.commit();
}

In practice, there should be a try/catch block there to roll back the transaction if an exception occurs, but I'm leaving out clutter for educational purposes!

This probably still isn't fast enough though. One reason is that, to fulfill the Jena model contract, the database driver checks that there are no duplicate triples as the data is read in. This requires testing for the existence of the statement prior to inserting it in the triple table. Clearly this is going to be a lot of work for large sets of triples. It's possible to turn off duplicate checking:

protected void loadData( String dbAlias, String source ) {
    ModelMaker maker =  getRDBModelMaker( dbAlias );
    ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
    model.setDoDuplicateCheck( false );
    model.begin();
    FileManager.get().readModel( model, source );
    model.commit();
}

The problem with this is that it moves the responsibility for ensuring that there are no duplicates from the db driver to the calling code. Now, it may well be that this is known from the context: the data may be generated in a way that ensures that it's free of duplicates. In which case, no problem. But what if that's not certain? One solution is to scrub the data externally, using commonly available tools on Unix (or Cygwin on Windows).

First we migrate the data to the n-triple format. N-triple is a horrible format for the human reader to read, but ideal for machine processing: every triple is on one line, and there is no structure to the file. This means, for example, that cat can be used to join multiple documents together, something that can't be done with the RDF/XML or N3 formats. Jena provides a command line utility for converting between formats: rdfcat. Let's take a simple example. Here's a mini OWL file:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:owl="http://www.w3.org/2002/07/owl#"
  xmlns="http://example.com/a#"
  xml:base="http://example.com/a">
  <owl:Ontology rdf:about="">
    <owl:imports rdf:resource="http://example.com/b" />
  </owl:Ontology>
  <owl:Class rdf:ID="AClass" />
</rdf:RDF>

Which we then convert to n-triples:

[data]$ java jena.rdfcat -out ntriple a.owl
<http://example.com/a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Ontology> .
<http://example.com/a> <http://www.w3.org/2002/07/owl#imports> <http://example.com/b> .
<http://example.com/a#AClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .

Assume that we have a collection of n-triple files (a.nt, b.nt, etc) and we want to remove all of the duplicate triples. Using common Unix utilities, this can be done as:

cat a.nt b.nt c.nt | sort -k 1 | uniq > nodups.nt

The sort utility sorts the input lines into lexical order, while -k 1 tells it to use the entire line not just the first field (sort splits lines into fields, using whitespace as a separator). uniq condenses adjacent duplicate lines into one, which is where the duplicate triple removal happens.

Finally, what do we need to change in the original program to load n-triples instead of RDF/XML or OWL files? Happily, nothing! The Jena FileManager uses the extension of a file to guess the content encoding. *.nt triggers the n-triple parser, so since we used that convention in naming the file we're done.

On a recent project, loading a million-triple model into a MySQL 4 database took me just about 10 minutes using these tips, while before optimisation it was taking hours.

Updated: oops - forgot to tag this entry ...

del.icio.us: , , , .

technorati: , , , .

9 comments:

Katie Portwin said...

We are currently loading lots of data into a persistent Jena model, and have had to grapple with load performance.

I like your idea of turning off doDuplicateCheck, and scrubbing another way.

However, am I right in thinking you're considering a one-off big load? I'm dealing with *ongoing* loading - ie, I'd have to handle potential duplicates involving one triple in existing store, and one in the new batch. Any ideas?

Ian said...

Hi Katie,
Yes, I was thinking about a one-off load, not an ongoing process. You try, as an experiment, writing the existing data out to n-triple (you could use a separate process to do this since more than one process can see the same database), then use comm to locate those lines (ie triples) that are in one file and not in the other. These are the incremental triples to load. I can reason why this should reduce the incremental load time, but whether it makes enough difference in practice is something you'd need to try.

This of course assumes that (i) the database doesn't change while performing this process, and (ii) that the new data is purely additive (i.e. no deletes).

Patrick Paulson (patrick_paulson@yahoo.com) said...

My current plan is to
1) create a new unique index in the underlying database on the statement table
2) subclass Jena's ModelRDB to look for places where adding a duplicate might happen
3) catch the resulting database error from trying to insert a duplicate into the index, and ignoring it.

Jena's RDB performance notes hint at this solution, and even point to the sql files in the distribution that are used to initialize the database -- this is where the new index definition would be added.

garrett said...

Interesting post. Jena must have deprecated or completely removed the setDoDuplicateCheck because that method no longer exists. bummer.

Ian said...

Garrett -
It's still there:

doDuplicateCheck

Note that it's on ModelRDB not Model.

Regards,
Ian

Revi S. said...

Any ideas on how to prevent Jena from creating duplicate triples during a load because of bnodes. They defeat the doduplicatecheck purpose anyway.

Example if I were to load this rdf file into a persistent Jena model http://planetrdf.com/bloggers.rdf , and suppose I did multiple loads of the same information -- the number of triples in my store grows about 50% each time, EVEN though the source is the SAME.

Any ideas on how to prevent this?

Ian said...

Hello revi s,
You asked the same question on the Jena support list, and the answer I'm going to give is the same. A bNode is an existential variable: two triples with different bNode subjects but the same predicate and object are not the same triple. Think of it this way: suppose you have "there exists something, call it X, with colour white" and "there exists something, call it Y, with colour white". If those two triples are duplicates, it would mean that X and Y were the same resource. Which would be unfortunate if X was a glass of milk, and Y was the Antarctic.

If you know that the predicate of the triple is inverse functional, so that it unambiguously identifies one resource, then a reasoner is entitled to conclude that the two bNodes were owl:sameAs. But that happens at a different level of abstraction than duplicate suppression in the storage layer.

Ian

Revi said...

Thanks Ian for your response. I'll need to use rules to weed out what can be inferred to be the same bnodes based on IFPs. Makes sense that this is not part of the duplicate suppression layer.

Term Papers said...

I have been visiting various blogs for my term papers writing research. I have found your blog to be quite useful. Keep updating your blog with valuable information... Regards