Tuesday, October 27, 2009

ISWC In Use Track - raw notes 4

Rapid: enabling scalable ad-hoc analytics on the semantic web - Sridhar et al

Motivation: rapid growth in RDF data. Progress on storage, but not analytics.

analytical queries include multiple groupings and aggregations. E.g for each month of the year, the average sales vs the sales in the preceding month. Hard to do in databases, even hard in RDF because: absence of schema, combine data and metadata.

goal: using map-reduce to do RDF analytics. High-level dataflow languages e.g. pig, latin, etc, but these languages expect structured not semi-structured

RAPID uses pig as a basis. Extend pig latin with RDF primitives. showed raw pig latin program - about 10 steps. Q: how to automate/abstract this, to avoid chance of user errors? [missed a bit here]

expression types: class expression, path expression. Three key functions: generate fact dataset, generate base dataset, multi-dimensional join. GFD re-assembles n-ary relationsships from triples. GBD - container tuples for each group for which aggregation is required. MDJ find match between base and fact tuples, and update base dataset.

Reasonable results compared to non-optimised MapReduce applications. Comment from the audience: very slow (five orders of magnitude) compared to traditional data-warehousing.

[Saw comments on IRC via Twitter that this is just like early 90's BI applications. The example wasn't well chosen from that pov, but I think this is quite interesting. Doing analytics on large scale datasets is going to be a huge problem in my opinion]

No comments: