I was recently a key resource involved in migrating OLAP1 cubes from proprietary software to FLOSS2 setup utilizing PostgreSQL and Pentaho BI. I learned that OLAP isn't really a PostgreSQL domain but getting basic support for analytical queries will definitely make many tasks easier.
There are two main use cases for databases - one can either perform online transaction processing or online analytical processing. The main difference between the two is that while OLTP3 uses a lot of relatively simple read and write operations and depends on ACID4/MVCC5, OLAP does a lot of complicated and very heavy read operations and isn't really that concerned with concurrency issues.
Other difference is that OLTP uses the traditional, two-dimensional relational model where you have tables and relations between them while OLAP analysis is done in multi-dimensional cubes. Imagine a products table; a product's price is determined by two factors - productid (row) and price (column). So the price is an intersection of two dimensions. Now imagine your business runs across borders, so the price is determined by more, than two factors - productid, price and currency - for one and the same product a price can be different in GBP and EUR. So the price is an intersection of the three. But the price of the same product in the same currency can differ based on the country - the same product can cost X EUR in, let's say, Germany, and Y EUR in, let's say, Ireland. So there can be even more, than 3 factors. So in OLAP these factors are called dimensions and what's on the intersections of all the dimensions is called measures.
During my adventure with moving the cubes my tool stack was quite simple; I used only PostgreSQL as the ROLAP6 storage and Pentaho BI as the engine to access it via MDX7 queries. It was working fine, but it was slow. Why is that?
Well, you might have noticed a new term. ROLAP differs from MOLAP8 in a very significant way. While MOLAP systems tend to store pre-calculated values in pre-aggregated tables and have their data files optimised for fast data access ROLAP systems depend on live SQL queries to fetch, calculate and aggregate what the user wants. This imposes a serious overhead and makes working with cubes a lot harder than with cubes stored in MOLAP systems.
It is worth mentioning though, that ROLAP has some advantages over MOLAP; there's a quite good section on this in Wikipedia's article. One is that you can actually setup your OLAP schema with a relational DB's schema, so you can have your facts in one table, dimensions in other tables and use foreign keys to link them together. You can arrange your tables in either a snowflake9 or star10 schema.
So to run a ROLAP system in a way allowing good user experience you really need aggregate tables. Without them all the calculations and aggregations are done live, and with large data sets it's jut not acceptable. But creating aggregate tables is really tedious, especially if you have a lot of dimensions. You have to create aggregate tables for each and every combination of dimensions that a user might want to use. Of course you have to repeat the process so your cube is up to date.
Technically to implement aggregate tables I decided to use materialized views11. It's a pretty obvious choice and Pentaho allows use of them (though I had to create a little patch). Basically you declare your matviews in the OLAP schema file and the engine will consider them while running users' queries. Great tool, really.
But creating the views demanded a lot of manual work and also a lot of computation time. Each
CREATE MATERIALIZED VIEW and
REFRESH MATERIALIZED VIEW required a sequential table scan for quite a large fact table and many other, sometimes also large, dimension tables.
This made my cube refresh process quite lengthy.
With the recent addition to PostgreSQL the task of creating aggregate tables becomes easy. Thanks to
CUBE it is possible not only to create a set of aggregates in one go, but more important, using only one sequential scan.
This is really important functionality and has the potential to make a huge difference while using PostgreSQL as ROLAP storage. From what it looks it shouldn't cause any trouble to Pentaho to use materialized views created with the new keywords, but I haven't tested it.
While working on the task of moving cubes to FLOSS-powered solution I found, that MOLAP systems are much faster in giving the end user what she wants. Actually, the consensus in the community seems to be the same. That's why the patch is a big deal.
One thing to remember though. This isn't everything you need to have OLAP, since PostgreSQL won't process your MDX queries. Pentaho BI is another piece of the puzzle.
On-Line Analytical Processing ↩
Free/Libre/OpenSource Software ↩
On-Line Transaction Processing ↩
Atomicity, Consistency, Isolation, Durability - a set of rules guaranteeing reliable processing of database transactions ↩
MultiVersion Concurrency Control - concurrency control model allowing database transactions to see a consistent and stable database snapshot ↩
Relational OLAP - a way of accessing OLAP entities through a RDBMS ↩
MultiDimensional eXpressions - a query language similar to SQL used to access OLAP data ↩
Multidimensional OLAP - an OLAP data store using OLAP-optimised storage techniques ↩
in a snowflake schema a fact table is connected with many dimension tables which in turn are connected with other dimension tables allowing dimensions to be normalized even for a multi-level dimensions ↩
a star schema is a special case of snowflake schema where every dimension is represented by one table even if the dimensions has multiple levels ↩
materialized view is a physical representation of the data returned by the view's underlying query, it has to be refreshed periodicaly ↩
If you like this article please consider sharing it with your friends.