Pentaho and Vertica as Business Intelligence / Data Warehousing solution

Introduction

I recently wrapped up a BI/Data Warehouse implementation project where I was responsible for helping a rapidly expanding international e-commerce company replace their aging BI reporting tool with a new, more flexible solution. The old BI reporting tool was based on a “in-memory” reporting engine, was more of a “departmental solution” than an enterprise-grade one, and was not optimally designed. For example, users found themselves downloading data from different canned reports to Excel where they ran VLOOKUPs and pivot tables to compute simple metrics such as average order value and average unit retail. Needless to say despite best of intentions, there had been a communication gap between business users and IT developers on reporting requirements during the implementation of the original BI tool.

In designing and implementing the new solutions, I set the following strategic tenants / guiding principles:

  • leverage commercial off-the-shelf (COTS) software; minimize customization and emphasize configuration instead (i.e., chose to buy instead of build, and made sure to not to build too much after buying)
  • involve all stakeholders and business users throughout process
  • enable business users to use self-service BI tools as much as possible
  • train as needed; up-skilling user base on self-service tools is better than hiring army of BI analysts
  • leverage data warehouse for both internal and external reporting
  • minimize amount of aggregation in Data Warehouse (we did almost no aggregation)
  • maximize the processing power of the ROLAP engine by pairing it with a high-performance analytical database (i.e., columnar MPP database)
  • stick to Kimball data warehouse design approach as much as possible, but be pragmatic where needed; Star Schema, Star Schema, Star Schema! (no snowflakes here)
  • take an iterative approach where possible – need to “ship” on time – understand that 1st release will not be “perfect” but does need to meet business requirements
  • for external reporting, provide canned reports only initially; test user adoption and work with clients to understand and address reporting needs over time

We looked at traditional players, open source, emerging technologies, and Cloud BI SaaS providers. I made sure business and IT stakeholders were part of the vendor selection process, ensuring they attended demos and vendor presentations. In the end, Pentaho best matched all our needs, providing us with both a solid ETL and BI reporting engines. Since we looking at providing both internal and external reporting with this solution, traditional BI vendors were prohibitively expensive, and “cloud offerings” were not compatible with our current IT capabilities and architecture (our data was not in the cloud).

Solution Description – Vertica + Pentaho BI/PDI

I proposed and received approval from our senior management and company board of directors to use Pentaho and Vertica as our Business Intelligence (BI) / Data Warehouse (DW) solution.

Vertica

HP Vertica is a columnar MPP database that is 20-100 times faster than Oracle. HP Vertica is available in a Community Edition; allowing organizations to use all the features of the database for free for data up to 1TB on three nodes. You can also install the database on a single node, though for a true proof of concept, you should get at least 3 nodes. We started using Vertica 6.1 Community Edition for proof of concept (POC) and then later upgraded to an enterprise license when we went live in production.

Pentaho

Pentaho is an open source BI platform and ETL tool. I liked the fact that it was open source; allowing us to highly customize the BI implementation if we chose to, as well as develop our own ETL connectors and routines. Some of the client tools are a bit quirky, but I do not what BI/ETL software isn’t, given the complexity of these tools. Overall the product is solid and delivers as expected. We got the enterprise edition for the additional features and product support from Pentaho. One thing that is annoying, is all the configuration files that are spread all over the place. To be fair, this is probably more of a Java application configuration issue, than a Pentaho issue.

When I tell people that I’m using Pentaho, they are usually surprised; then I find out they were using Pentaho 3.x and then I’m not surprised by their reaction. Pentaho 4.x is a big step up from previous major releases, and Pentaho 5.0 is looking really good (I like their UI redesign). I encourage anyone who looked at an early version of Pentaho to take another look. The product has matured and is worth another look.

When I was selecting a BI vendor, the thought “no one ever got fired for choosing IBM (Cognos)” crossed my mind. I could have gone the “safe” route and used one of these other tools. However, I believe the combination of Vertica + Pentaho has delivered more value to the organization in a shorter amount of time that it would have been for us to realize with these other vendors. For our organization, for our business needs, and for the realities of our IT capabilities at the time, Pentaho + Vertica was the way to go. We delivered the project on time and within budget (and without astronomical first-year costs). We have 100% user adoption internally, and are getting very positive feedback from our merchant clients.

Results

  • Recognized by CEO for on-time, on-budget implementation; received “A” grade on end-of-year Enterprise-wide Strategic Initiatives Scorecard
  • Excellent user adoption
  • Positive feedback from external clients
  • Reduced manual reporting tasks over 50% (and over 80% in certain departments)