Tuesday, November 16, 2010

Aggregation for Pentaho/InfoBright

Infobright is a column-oriented database, so it is efficient for Business Intelligence (InfiniDB is worth checking too). But my current BI project has 24 dimensions so I also needed aggregates to reach a good level of performance.

Infobright is usually more or less supported by most of the Pentaho suite, but sure not by the Aggregation Designer. I must be one of the first persons who tried, I have found many bugs and workarounds. It was a tough ride, but finally I have found a way to make it work, so here it is!

1) Open PAD and design your aggregate the usual way. Ignore the primary key errors.
2) When you're done, in "Export and Publish", execute the DDL.
3) Try executing the DML, but chances are it will fail. This is probably because Infobright does not support INSERT well.
4) So instead, click on "Preview" and copy-paste the DML SQL code.
5) Open PDI and create a "Table Input" step with the copied SQL's select-from-groupby portion.
6) Connect it to an Infobright Loader step that will write the data into the table created by the DDL.
7) Run the transformation. You can use cron and pan.sh to run it automatically every night.
8) Back to PAD's "Export and Publish", you would normally publish your updated schema, but it results in a NullPointerException.
9) So, export your updated schema
10) Open it with Schema Workbench, ignore the primary key errors, and publish from there.
11) That's all!

Tested with: pad-ce-1.2.1.RC1 biserver-ce-3.7.0.RC1 pdi-ce-4.1.0-RC1 infobright-3.4.2-0

Friday, November 5, 2010

Automatic report generation now possible in Pentaho Data Integration

Last Friday, Pentaho Data Integration (PDI) developer Matt Casters posted a preview of a new tool that allows Business Intelligence designers to include report generation (PRD) as a step of PDI. This is extremely useful, because the obvious step after ETL is often to generate reports.

Let's say a retail chain wants to send, everyday, to every shop manager, a report detailing this shop's performance and trends.
Imagine you have a data warehouse that contains all sales records, and a PRD report template. Then here is how to create a system that will automatically generate and send the reports everyday:
  1. Install the bleeding-edge PDI 4.1.0 RC1
  2. Add Matt's plugin as explained here
  3. Open PDI "Spoon" and create a new Transformation
  4. First, create a "Table Input" step to get for each shop it's code, name and email address.
  5. Second, create a minimal JavaScript step to compute the output's filename and set the PRPT file's name.
  6. Third, use the new "Pentaho Reporting step", and configure it to use your freshly computed PRPT and output filenames, as well as the shop codes.
  7. Finally, create an "Send mail" step and set it to use each shop's email address, putting the generated report as an attachment.
  8. Save and configure your cron to launch this transformation every night via the pan.sh command-line tool.