I've been asked some time ago by a colleague to tell something more about my adventure with Pentaho and how did I make it play nice with PostgreSQL. So here it goes, here's my story.
So a decision had been made to try and migrate from SSAS1 to some alternative which would be preferably cheaper. PostgreSQL was chosen and as a consequence I was hired. Well, they also needed a sysadmin, so it was a pretty sweet deal for them to get one person that actually had experience with both databases and systems.
The first thing I needed to do was to find out what tools I can use to actually make OLAP work with PostgreSQL. I remembered that Paweł, the colleague mentioned above, did some ETL work with Pentaho's Kettle. So I asked him what were his thoughts about using it to do OLAP; he seemed to think it was possible but couldn't provide any details.
So I started my research and pretty soon I did found some comments about Mondrian, another part of Pentaho's suite, that is designed to utilize PostgreSQL-backed databases to do OLAP queries. It could be used with the full suite to design and view OLAP cubes and it also provided a web-service endpoint to access the data. But one major issue with it was that it couldn't be used with MS Excel, and one of the project requirements was that end users shouldn't notice any difference (mayby apart from performance, which was then thought to be improved with the migration) when using their Excel-based dashboards.
I then had a talk with a guy maintaining the SSAS-backed cubes, a certified Microsoft specialist, and he wasn't surprised about it. Generally he was quite pessimistic about Microsoft allowing third-party product to work with Excel since it would be bad for their business with SSAS. But one very positive outcome of this conversation was that I could structure my knowledge about how Excel and SSAS talk to each other. So the industry standard for accessing OLAP is XML/A2 and the query language used is MDX3. I had something to go on with.
It was a breakthrough for me when I found XMLA Connect - a small, simple Excel plugin allowing the use od XML/A endpoints as pivot table datasource. Wait. XML/A endpoints? Hmm... I remembered that Mondrian provided a web endpoint so I checked the name of it... and it was Xmla! So that was it, I had all the pieces of the puzzle - a backend (PostgreSQL), a middleware (Mondrian/Pentaho) and an interface (XMLA Connect+Excel).
The next step was to acually duplicate the ETL process and do some testing. The data was already there, most of it had been migrate from Oracle to PostgreSQL. So what the ETL actually did was to copy from PostgreSQL to MSSQL, copy some of the non-migrated parts from Oracle to MSSQL and launch the cube-building processes. Since I was working with Postgres as my backend I could skip most of the copying and use FDW4 to connect to Oracle. But it turned out not to be that simple.
The important thing to realise is that what Mondrian actually does is only to translate MDX queries to SQL queries, fetch the results and push them through XML/A endpoint. Of course, you can utilise Pentaho suite to design your OLAP schema, but what you still need is the actual data in the backend. You can try and use your regular OLTP database as the source, but I wouldn't advice doing that - it'll probably be very slow, you can find out more in my previous article on OLAP in PostgreSQL.
To avoid slowness and also to offload read-heavy queries from the primary database I decided to set up two more databases - a read-only replica utilising streaming replication and a Mondrian-dedicated DB for the OLAP schema. My first thought was that I can connect the OLAP-dedicated database to the replicated secondary one using FDW and then transform the data into OLAP-friendly schema utilising materialized views and connect Mondrian to them. It was working, but fetching large amounts of data via foreign data wrapper was really slow.
Since I have a really good amount of experience with Slony-I and I value this solution very much I decided that I can use it to feed my OLAP-dedicated DB with the data instead of using slow FDW. It turned out to be a good decision. So I had a working ETL process consisting of two types of replication between three databases, of materialized views and some custom perl script that I wrote in order to refresh the MV's in parallel.
Next big thing on my list was to actually create the cubes using Pentaho's Schema Designer. It's a pretty nice tool, allowed me to do the ground work, but some more complex cubes I had to do by hand. The good thing was though that I already had the XML files generated by Schema Designer, so altering them was quite simple. Actually, at this point I was almost certain I could complete the project, so I was determined to make it work.
Finally I had the working solution that duplicated all the futures of SSAS-backed cubes. The only problem was that it was quite slow in cases when Mondrian couldn't find any preaggregated materialized views and had to issue complex SQL queries. This I also covered in a bit more detail in the previous article.
So it was a quite long road making Pentaho/Mondrian work with PostgreSQL and provide data for MS Excel-based dashboards, it took me around 5 months with figuring out how the cubes work on SSAS through research on OLAP in Postgres to actual PoC and to the final solution, but obviously I had a lot of other responsibilities and could only spend a part of my time on this. But it was a great experience, I learned a lot and judgeing from Google's answers nobody ever did such a big project connecting all the dots together, but maybe I just didn't find it or it was done but not published.
Well, here it is published now and I sincerely hope it will be of use to the community.
If you like this article please consider sharing it with your friends.