[nQSError: 42039] Columns in BY clause of REPORT totalling function must be in select list. (HY000)

I got this error today with OBIEE 11.1.1.7.0. Obviously, all my columns were in the select list. I found a few blogs and articles with people asking for help for that bug but none of them provided/found a solution. Here is mine.

The Error

[nQSError: 42039] Columns in BY clause of REPORT totalling function must be in select list. (HY000)

The symptoms

The error is thrown when sorting one specific column (a measure) of a Table with 14 columns ((5 measures, 7 dimensions columns including 2 hierarchies). All the others columns could be sorted without problem except one.

 The solutions

Uncheck “Report-Based Total (when applicable)” in the aggregation rule of your measure. That’s it !

OBIEE 11G Aggregation Rule

 

It worked for me !

Query Twitter API with Pentaho PDI (Kettle)

There are a lot of libraries or tools to query the Twitter API using Python, PHP, Javascript and other programming languages, but there is nothing to query it using Pentaho PDI (aka Kettle). So with a little research and time, i successfully implemented the OAUTH authentication required to use the Twitter REST API using native Kettle steps. I can now query Twitter with Pentaho Data Integration !

1 – Register an application with Twitter to get the your API and Consumer Key

Before stepping into Kettle, you will need to register an Application to obtain your OAUTH Access Token. Create an application using the “tokens from dev.twitter.com” (follow the steps from the previous link.. this is pretty straight forward). Once the application registered, you have the minimum information to build the OAUTH request:

  • API Key
  • API Secret
  • Access Token
  • Access Token Secret

The information can be found in the API Keys tab of your Application.Twitter Access Token

2- Understanding how it works

The Twitter API is based on HTTP Protocol and authentication relies on OAUTH Protocol. So basically, all the HTTP requests have to provide OAUTH information in the HTTP Authorization Header as per the image below:

Twitter API HTTP Request

 I suggest to read the Authorization Request page of the API Documentation to understand the specifications. Read it twice. The challenge with Kettle is to generate the HTTP Header.

3- Create a transformation

This is where the fun starts! Now that you have read Authorization Request page twice, you know that some information like OAUTH_NONCE and OAUTH_TIMESTAMPS have to be generated for each request. Before stepping into details, let’s see the big picture with an example.

I will do a request using the Search/Tweet call and search for the word “#pentaho”.

  1. I need the URL where to send the request
  2. I need specific parameters to the Search/Tweet call
  3. I need to generate the HTTP Header
  4. I need to send the HTTP request
  5. I need to extract the data from the JSON result

The transformation should look like:

Pentaho Kettle Twitter API Call

Let’s see each step in details

3.1 – Data Grid

I use 2 different Data Grids, the first is all the static data needed to create OAUTH authentication. The second is the API URI and HTTP Method used to do the call. The OAUTH data grid is always the same whatever the call is, so it can be copied/pasted in other transformation if needed. It also allows creating a more advanced transformation using subtransformation.

Pentaho PDI Twitter DataGrid

 3.2 – Modified Java Script Value – Generate Header

It is possible to achieve almost anything with the Java Script step. I used it to generate the HTTP Header with 2 libraries I found on Internet. All the credit goes to Netflix Inc and Paul Johnston for publishing the  libraries that do the magic. I just added a few lines at the end of the script.

Make sure to edit the variable URL_PARM at line 763 to reflect the specific parameters corresponding to the API Call.

For the Search/Tweets call, the parameters are
https://api.twitter.com/1.1/search/tweets.json?q=%23pentaho&count=4&result_type=mixed

So the line 763 will be

encodeURIComponent is used to “URL ENCODE” the string.

Here is the full script.

 

The oauth_xxxxxx variables come from previous Data Grids. The final result is the REST_HEADER and REST_URL. Both variables are sent to next transformation’s step (REST Client).

3.3 – Rest Client

The Rest Client step is pretty simple because all parameters have been defined in the previous steps. The Headers tab enables you to define the content of any HTTP headers using an existing field. This is where the REST_HEADER variable is defined as Authentification header.

Pentaho Kettle Twitter API RestClient Pentaho Kettle Twitter API Rest Client Header

 

The answer from Twitter is sent to the “result” variable. It could be the expected answer but it could be an error message too. A good error handling mechanism has to be developed. For the sake of the demo, I assume getting the expected answer.

3.4 – JSON and Text Output

Twitter sends the result with a JSON format. The first “Text Output Raw Json” step is a debugging step I use to develop the transformation. It’s very useful to have the original JSON answer to build a good JsonPath expression in the JSON step.

For the demo, I created 2 CSV, the first for the twit message and the second for each hashtags of each twit message.

Conclusion

As you can see, it’s pretty easy to retrieve data from Twitter. You can call another API’s function by changing the URI (in the datagrid), the parameter string (in Javascript) and the Jsonpath (JSON Step). And there you go ! You now have data to create analysis!
You can download the demo transformation from my github repo: https://github.com/patlaf/pdi_scripts

Problems with Pentaho Spoon and libgobject

I got some problems with Spoon after upgrading Kubuntu 13.04 to 13.10. Nothing really noticeable at first, but suddenly it starts to randomly crash:

Or

Same thing with different version of Java

I found that starting spoon with a none-GTK3 theme fixes the problem.. It doesn’t crash anymore but I do have a lot of warning related to libgobject in the log.

I tried to change the swt.jar to different versions but it didn’t change anything. I’m stuck with the insane amount of warnings, but at least I can use Spoon !

If you have a solution, please let me know !

References:

Quick tips to troubleshoot the first installation of Pentaho BI 5.0.1 CE with MySQL

Last week, I successfully installed Pentaho BI Server Community Edition. It wasn’t as easy as I expected. I did follow a really good tutorial from David Fombella at http://anonymousbi.wordpress.com/2013/12/15/pentaho-bi-server-5-0-1ce-mysql-installation-guide/  Like many people, I did have the famous 404 error. It’s not a big deal, there is no error that can’t be explained with the log files, right ?

Pentaho Logo

Consider the following as an extension of the tutorial

Tip #1

Be careful at step #5 and don’t forget to delete tomcat/conf/Catalina/localhost/pentaho.xml . Keep in mind that every time you modify tomcat/webapps/pentaho/META-INF/context.xml, pentaho.xml has to be deleted because it’s an exact copy of context.xml.

Tip #2

Don’t copy/paste the configuration provided in the tutorial for pentaho-solutions/system/jackrabbit/repository.xml  (Step #6) because the CLASS attribute of the all the nodes is missing ! The XML file already has examples that you can copy past with all the nodes and the proper attributes.

Pentaho.xml is not the only config file that is duplicated. I spent HOURS searching the node with the missing class attribute  in JackRabbits’s repository.xml (yup, I did copy paste from the tutorial..). I ended up reverting all my modifications and I still had the missing class error in the log file, over and over.  The fact is: if you do start pentaho with the missing class attribute in repository.xml, it will create a folder named pentaho-solutions/system/jackrabbit/repository/ which seems to have generated/copied files based on the configuration. The problem is it generates a file with a missing class attribute ! It took me too long to realize pentaho-solutions/system/jackrabbit/repository/ IS NOT PART OF THE DEFAULT INSTALLATION. It’s generated. If you screw up the config like me, DELETE THE ENTIRE FOLDER.

Tip #3

Another thing you have to know is about the folders tomcat/work/Catalina and tomcat/temp. Both folders contain temporary file that can be removed. But the folders themselves must exist! For an unknown reason, my tomcat/temp folder was missing and that prevents Jackrabbit to start properly with the not-so-useful error:

Finally

So, if you are getting problems with the installation like many people, including myself have, just execute this every time you change a config file. It will allow to start with a clean environment.

I hope it could save you time !

Oh! One last thing..don’t try to run it on a 1Gb RAM server.. it won’t work. :)

 

References:

How to search for a specific column in all Oracle schema

As an ETL developer, this is something I have to do when developers of a source database application have to extend the length of a field. Because that field can be used in so many different projects and/or datamart, I need to find all the possible tables and views of all schemas having the field in my database.

Here is how I do it.. Suppose i’m looking for a column named “awesome_id”.  All I have to do is to execute this query with a user having the privileges to read into all schemas.

It will return all the tables having a column with the word “awesome_id”.

But it didn’t look the queries used to build views. Instead I have to use:

Unfortunately, it only returns the first 4000 characters of the queries, but the lookup is done on the complete queries.

This WordPress blog runs on MariaDB

Recently, my hosting company Funio has replaced MySQL for MariaDB on their shared hosting infrastructure. Some friends working there told me they get better performance with MariaDB than their previous MySQL installation.

MariaDB is a drop-in replacement of MySQL. This WordPress blog was first installed with MySQL, and suddenly and automagically runs on MariaDB. As a DBA, I was excited to use all the new features… but I realized that I’m pretty limited on Funio’s database infrastructure since I’m just a simple user…! So, just for fun, I changed the engine of all my tables to ARIA.  It’s the least I can do.

ARIA is a crash-safe alternative to MyISAM and according to plans, it will become Transactional in the next major release. I don’t expect it to be as fast, reliable and robust than InnoDB (or XtraDB.. the altered version of InnoDB developed by Percona and included in MariaDB), but it’s definitely a good “hybrid” solution between MyISAM et InnoDB.

I didn’t experience issue since the changes, so if you are a WordPress user, don’t be afraid to give it a try !

MariaDB

 

 

 

The problems you may face when upgrading PDI 4.3 to 5.0.1

Upgrading PDI from one version to another is a pretty easy task. You just have to download the latest version and overwrite your existing “data-integration” folder (in fact, I don’t recommend to overwrite.. create a new copy aside the other!) . There is a few blog posts and an official documentation to do it properly.

However, what all these blogs do not tell is what kind of problem you can expect after the upgrade.  Almost every step from newer version are compatible with the older one. That is true. But there are a few steps you should especially care about.

This is the problems I personally faced.

The first is in Spoon with the Zip File Step. I wasn’t able to open the popup to edit the step.

Pentaho ZipFile Dialog

There is no workaround. You have to delete the step and recreate it. I reported the bug:  Zip file Step from PDI 5.0.1 is not compatible with older versions of PDI. It’s not related to the upgrade, but I also found a bug when you append a file to an existing zip with 5.0.1.

 

The second problem was with the “SQL” step of a job. If the “Send SQL as single statement” is checked, the step produces this error (and nothing more..!):

Make sure you have a semicolon at the end of your statement and uncheck “Send SQL as Single statement”. I didn’t fill a bug report for this one..

 

I sometime have the following error in the log.. It appears randomly and does not produce any obvious problems. So it’s a bit annoying to debug..

I still haven’t found why

 

The last things I noticed is a bit disappointing. I explained it in a previous post. PDI 5.0.1 is a little bit slower than 4.3.

 

References:

Pentaho PDI version 5.0.1 is slower than 4.3

After updating Pentaho PDI 4.3 to 5.0.1, I noticed that my ETL process took 45 minutes longer to execute. So I ask myself if the new version is really slower than the old one ..

I compared the two kettle versions with the same transformation (a single Table Input to Ouput a Table). Same database, same table, same driver, same everything except PDI release (4.3 and 5.0.1).

Pentaho PDI Transformation

Here is the result:

5.0.1  5.0.1 4.3  4.3
Rows/Sec Total Sec Rows/Sec Total Sec
Run #1 33712 50 35864 47
Run #2 31804 53 36643 46
Run #3 33051 51 37458 45
Run #4 33712 50 37458 45
Run #5 33712 50 36643 46
Run #6 31804 53 38309 44
Run #7 33712 50 36643 46
Run #8 32415 52 36643 46
Run #9 31804 53 37458 45
Run #10 31215 54 36643 46

Pentaho Kettle 5.0.1 Benchmark

Yes, Pentaho Kettle 5.0.1 is slower than 4.3. Sad but true, but this is not a deal breaker considering the new features that come with it.

Tuning tips every PDI user using MySQL should know: increase table output speed by 7000%

I found a thread on Pentaho Forum about Table Output Performance.  It gives a really easy way to speed up your table output.. A LOT! Some people have been able to reach 84k rows/sec using the method. It is not a joke. It works.  It really does.

All you need to do is to give the MySQL JDBC driver 3 parameters:

It allows to insert many rows with a single statement instead of one row per statement. Also, useCompression reduce the traffic between the server and pentaho.

I tested it myself.

I used a 869mb CSV file (5248714 rows) from my Laptop and transferred it to an out-of-the-box MySQL installation on a cheap VPS server (1 CPU, 1gb ram) with a poor network in Europe (yep.. across the ocean). The expected performance is a very slow transfer.

I set the number of copies to start of my Table Output to 4 and commit size to 1000.

Here is the result without the options: I reached 52 rows/seconds.

NoOptimisation4TableOutput

The same test, but with the 3 connection parameters: 3674 rows/second ! Wow ! 7000% faster !

Optimisation4TableOutput

Impressive isn’t it !?

 

References:

Three different ways to create a CSV with MySQL

There are 2 well known ways to create a CSV using MySQL. But there is a technique I’ve been using for many years that most blogs and forums don’t talk about. I do believe it’s worth showing the technique I use since it’s faster.

First, there is the “standard” way:

The second technique is by using mysql client and sed

And the technique I use is by using the CSV storage engine.

I now have a file named test_csv_table.CSV in my MySQL datadir! If you don’t know where is your datadir, use the statement:

Obviously, it requires to have the privileges to read the mysql datadir. Also, keep in mind the CSV engine doesn’t support nullable columns.

What I like with that third technique is I don’t have to remember the complex syntax of the other techniques. It’s pretty straight forward.