The problems you may face when upgrading PDI 4.3 to 5.0.1

Upgrading PDI from one version to another is a pretty easy task. You just have to download the latest version and overwrite your existing “data-integration” folder (in fact, I don’t recommend to overwrite.. create a new copy aside the other!) . There is a few blog posts and an official documentation to do it properly.

However, what all these blogs do not tell is what kind of problem you can expect after the upgrade.  Almost every step from newer version are compatible with the older one. That is true. But there are a few steps you should especially care about.

This is the problems I personally faced.

The first is in Spoon with the Zip File Step. I wasn’t able to open the popup to edit the step.

Pentaho ZipFile Dialog

There is no workaround. You have to delete the step and recreate it. I reported the bug:  Zip file Step from PDI 5.0.1 is not compatible with older versions of PDI. It’s not related to the upgrade, but I also found a bug when you append a file to an existing zip with 5.0.1.

 

The second problem was with the “SQL” step of a job. If the “Send SQL as single statement” is checked, the step produces this error (and nothing more..!):

Make sure you have a semicolon at the end of your statement and uncheck “Send SQL as Single statement”. I didn’t fill a bug report for this one..

 

I sometime have the following error in the log.. It appears randomly and does not produce any obvious problems. So it’s a bit annoying to debug..

I still haven’t found why

 

The last things I noticed is a bit disappointing. I explained it in a previous post. PDI 5.0.1 is a little bit slower than 4.3.

 

References:

Pentaho PDI version 5.0.1 is slower than 4.3

After updating Pentaho PDI 4.3 to 5.0.1, I noticed that my ETL process took 45 minutes longer to execute. So I ask myself if the new version is really slower than the old one ..

I compared the two kettle versions with the same transformation (a single Table Input to Ouput a Table). Same database, same table, same driver, same everything except PDI release (4.3 and 5.0.1).

Pentaho PDI Transformation

Here is the result:

5.0.1  5.0.1 4.3  4.3
Rows/Sec Total Sec Rows/Sec Total Sec
Run #1 33712 50 35864 47
Run #2 31804 53 36643 46
Run #3 33051 51 37458 45
Run #4 33712 50 37458 45
Run #5 33712 50 36643 46
Run #6 31804 53 38309 44
Run #7 33712 50 36643 46
Run #8 32415 52 36643 46
Run #9 31804 53 37458 45
Run #10 31215 54 36643 46

Pentaho Kettle 5.0.1 Benchmark

Yes, Pentaho Kettle 5.0.1 is slower than 4.3. Sad but true, but this is not a deal breaker considering the new features that come with it.

Tuning tips every PDI user using MySQL should know: increase table output speed by 7000%

I found a thread on Pentaho Forum about Table Output Performance.  It gives a really easy way to speed up your table output.. A LOT! Some people have been able to reach 84k rows/sec using the method. It is not a joke. It works.  It really does.

All you need to do is to give the MySQL JDBC driver 3 parameters:

It allows to insert many rows with a single statement instead of one row per statement. Also, useCompression reduce the traffic between the server and pentaho.

I tested it myself.

I used a 869mb CSV file (5248714 rows) from my Laptop and transferred it to an out-of-the-box MySQL installation on a cheap VPS server (1 CPU, 1gb ram) with a poor network in Europe (yep.. across the ocean). The expected performance is a very slow transfer.

I set the number of copies to start of my Table Output to 4 and commit size to 1000.

Here is the result without the options: I reached 52 rows/seconds.

NoOptimisation4TableOutput

The same test, but with the 3 connection parameters: 3674 rows/second ! Wow ! 7000% faster !

Optimisation4TableOutput

Impressive isn’t it !?

 

References:

Three different ways to create a CSV with MySQL

There are 2 well known ways to create a CSV using MySQL. But there is a technique I’ve been using for many years that most blogs and forums don’t talk about. I do believe it’s worth showing the technique I use since it’s faster.

First, there is the “standard” way:

The second technique is by using mysql client and sed

And the technique I use is by using the CSV storage engine.

I now have a file named test_csv_table.CSV in my MySQL datadir! If you don’t know where is your datadir, use the statement:

Obviously, it requires to have the privileges to read the mysql datadir. Also, keep in mind the CSV engine doesn’t support nullable columns.

What I like with that third technique is I don’t have to remember the complex syntax of the other techniques. It’s pretty straight forward.

Oracle SQL Update Statement with INNER JOIN

Oracle (the database) does not support joins in the UPDATE statement. This is one of many tiny annoying fact about Oracle. Here are 2 different ways to overcome that limitation.

The subqueries

 

The updatable view

 

MySQL provides, in my opinion, a  clearer syntax to update

 

I can now refer to my blog as an extension of my brain :)

Is BI about content or container ?

What is business intelligence in everyday life? Is the BI team create content, or create the container that distributes content?

Certain teams develop performance reports using data from specific business applications. They receive requests from every department of the company. They play a role of data centralization. The interpretation is left to power-users and reports are limited to exactly what is required. These teams develop a container. They give views on information in the application.

Some teams are mandated to develop analysis and to provide discoveries. They centralize data from multiple sources with a common goal. Business analysts and power-users may be part of the team. These teams create new content. This is especially true if they make predictive analytics.

Have you noticed that I used the word “report” for the teams that develop the container and “analysis” for those who develop content? It can give a good hint about the kind of team you are.

However, this observation is insufficient to answer the question. We must look at the team operating model rather than the result it produces. Whether the team develops reports or analysis, it must pass through a development process: analysis, documentation, development, test / quality assurance and deployment. A project management model has to be adopted to avoid total chaos. The development itself is not limited to create a simple SQL query to display the result. There is a lot of architecture activities (the schema, the etl process, etc..). These activities are the same as a traditional software development team.

With that in mind, is BI is about content or container ? Sometime, BI teams have to create big containers to produce tiny, but so valuable content.

I still don’t know the answer…

Calling a (sub)transformation inside a transformation with Pentaho PDI

I recently discovered a great and powerful feature inside Pentaho Kettle. It’s the Mapping step. It allows you to call a transformation inside another transformation.

It is not intended to call a transformation the same way you do with a Job. The way I see it is like a way to create a new Step. If you have a developer background, I could refer to a function. If you have a sequence of steps you repeat often, create a mapping with that sequence! It’s a really nice shortcut.

Let’s take a closer look at how it works.

Suppose you have a transformation to retrieve the total and free ram of the server running the transformation, calculate the quantity of used ram (total – free = used) and write it to log. The transformation should look like this:

getRamTransformation

To create the mapping, you have to create a new transformation with 2 specific steps: the Mapping Input Specification and the Mapping Output Specification. These steps allow the parent transformation to pass values to the sub-transformation (the mapping) and get the results as output fields. In the example, I will create a mapping that gets the total and free ram as input, calculate the used memory and convert it to MB and send it back to the parent transformation. The transformation should look like this:

calcUsedRamTransformation

As you can see, the exact same 2 steps from the first transformation have been duplicated in the sub-transformation. The Mapping Input specifies the 2 fields that will receive the values from the parent transformation. The name of the fields don’t have to be the same from parent transformation.

mappingInput

In the first transformation, the 2 steps can be replaced by the mapping step.

getRamTransformationMapping

In the Input Tab of the step, the fields from parent transformation have to be mapped to the fields declared in the sub-transformation.

mappingSubTransformation

There we go ! It’s that easy !

You can download the transformation used in the screenshot

 

Don’t try to fool the transformation execution.

All the rules regarding the execution sequence still apply!

  • It is not possible to set a variable in the sub-transformation and get it in the parent transformation.
  • All the input steps start at the same time, regardless if the input is in the parent or sub transformation

Since I discovered the mapping, i’m using it a lot to make my etl processes way more modular.