Query Twitter Streaming API with Pentaho Data-Integration and R

On one of my previous posts (Query Twitter Api With Pentaho PDI), many people asked for a way to use the Twitter Streaming API with Pentaho PDI.  Implementing OAUTH and API calls with Kettle can be very difficult compared to the few lines of code required with many different programming languages. Even if I’m pretty sure it’s possible to do it with native steps , I decided to use R to make the call and parse the results with Kettle. This is way faster and easier. Here is how it works.

Kettle Twitter Streaming API

The Transformation itself is very simple.

  1. The Data Grid step provides the R script name to execute with a few parameters like the researched keyword and the stream timeout.
  2. The Concat Fields step concatenate all parameters into 1 variable (one command line)
  3. The Execute a Process execute the R script that calls the Streaming API.
  4. The result is sent back to Kettle and parsed with the JSON Input Step
  5. Finally, the tweets are saved to a CSV file

Once triggered, the “Execute a process” step executes the R script and uses “sdtout” and “sterr” to share the data. In our case, the R script output a JSON string (the result of the streaming api) to stdout.

This transformation allows the use of any other processes/scripts able to query the streaming API and output  the result to STDOUT. You can find the Transformation and the R script in my github repository.

The R Script

R is a free software programming language and software environment for statistical computing. You can download it on the official website, or it can be found in Ubuntu repository with apt-get.

The script uses the libraries streamR and ROAuth. As their name suggests, streamR provides functions to access Twitter’s filter, sample, and user stream and ROAuth provides generic functions to handle OAuth handshake and signature.

You need to have your consumer key and consumer secret from Twitter. If you don’t already have it, see my previous post.

You need to run the script at least one time before using it with PDI. In R, you can call the script using the source(‘/path/to/script.R’) function. It is required to generate the OAUTH signature.

Kettle Twitter Streaming API OAUTH

The handshake() function will open a page to authorize “the application” to access the data. Then in R , you will be prompted to enter the key that twitter just gave you when you authorized.

Twitter Streaming API Authorize Twitter Streaming API Authorize PINThe generated OAUTH signature will be saved into a file called pdi_R_twitter_oauth.Rdata. The data saved is reused for further api calls, so you don’t need to authorize the application each time. By default, the file is saved in the working directory of R. You can get it with the function getwd().

R getwd function

However, to make it work with Pentaho PDI, you need to find the file and move (or copy) it to your data-integration folder.

Pentaho Data-Integration-Folder

A Quick Recap

  1. Download and install R
  2. Run the script at least one time to generate pdi_R_twitter_oauth.Rdata
  3. Copy pdi_R_twitter_oauth.Rdata to your Pentaho data-integration folder

The PDI Transformation should now be able to use the R script to query the Twitter Streaming API! You can download it in my github repository. Note that it has been developed to work on Linux, but it should work on Windows too !

References:

Query Twitter API with Pentaho PDI (Kettle)

There are a lot of libraries or tools to query the Twitter API using Python, PHP, Javascript and other programming languages, but there is nothing to query it using Pentaho PDI (aka Kettle). So with a little research and time, i successfully implemented the OAUTH authentication required to use the Twitter REST API using native Kettle steps. I can now query Twitter with Pentaho Data Integration !

1 – Register an application with Twitter to get the your API and Consumer Key

Before stepping into Kettle, you will need to register an Application to obtain your OAUTH Access Token. Create an application using the “tokens from dev.twitter.com” (follow the steps from the previous link.. this is pretty straight forward). Once the application registered, you have the minimum information to build the OAUTH request:

  • API Key
  • API Secret
  • Access Token
  • Access Token Secret

The information can be found in the API Keys tab of your Application.Twitter Access Token

2- Understanding how it works

The Twitter API is based on HTTP Protocol and authentication relies on OAUTH Protocol. So basically, all the HTTP requests have to provide OAUTH information in the HTTP Authorization Header as per the image below:

Twitter API HTTP Request

 I suggest to read the Authorization Request page of the API Documentation to understand the specifications. Read it twice. The challenge with Kettle is to generate the HTTP Header.

3- Create a transformation

This is where the fun starts! Now that you have read Authorization Request page twice, you know that some information like OAUTH_NONCE and OAUTH_TIMESTAMPS have to be generated for each request. Before stepping into details, let’s see the big picture with an example.

I will do a request using the Search/Tweet call and search for the word “#pentaho”.

  1. I need the URL where to send the request
  2. I need specific parameters to the Search/Tweet call
  3. I need to generate the HTTP Header
  4. I need to send the HTTP request
  5. I need to extract the data from the JSON result

The transformation should look like:

Pentaho Kettle Twitter API Call

Let’s see each step in details

3.1 – Data Grid

I use 2 different Data Grids, the first is all the static data needed to create OAUTH authentication. The second is the API URI and HTTP Method used to do the call. The OAUTH data grid is always the same whatever the call is, so it can be copied/pasted in other transformation if needed. It also allows creating a more advanced transformation using subtransformation.

Pentaho PDI Twitter DataGrid

 3.2 – Modified Java Script Value – Generate Header

It is possible to achieve almost anything with the Java Script step. I used it to generate the HTTP Header with 2 libraries I found on Internet. All the credit goes to Netflix Inc and Paul Johnston for publishing the  libraries that do the magic. I just added a few lines at the end of the script.

Make sure to edit the variable URL_PARM at line 763 to reflect the specific parameters corresponding to the API Call.

For the Search/Tweets call, the parameters are
https://api.twitter.com/1.1/search/tweets.json?q=%23pentaho&count=4&result_type=mixed

So the line 763 will be

encodeURIComponent is used to “URL ENCODE” the string.

Here is the full script.

 

The oauth_xxxxxx variables come from previous Data Grids. The final result is the REST_HEADER and REST_URL. Both variables are sent to next transformation’s step (REST Client).

3.3 – Rest Client

The Rest Client step is pretty simple because all parameters have been defined in the previous steps. The Headers tab enables you to define the content of any HTTP headers using an existing field. This is where the REST_HEADER variable is defined as Authentification header.

Pentaho Kettle Twitter API RestClient Pentaho Kettle Twitter API Rest Client Header

 

The answer from Twitter is sent to the “result” variable. It could be the expected answer but it could be an error message too. A good error handling mechanism has to be developed. For the sake of the demo, I assume getting the expected answer.

3.4 – JSON and Text Output

Twitter sends the result with a JSON format. The first “Text Output Raw Json” step is a debugging step I use to develop the transformation. It’s very useful to have the original JSON answer to build a good JsonPath expression in the JSON step.

For the demo, I created 2 CSV, the first for the twit message and the second for each hashtags of each twit message.

Conclusion

As you can see, it’s pretty easy to retrieve data from Twitter. You can call another API’s function by changing the URI (in the datagrid), the parameter string (in Javascript) and the Jsonpath (JSON Step). And there you go ! You now have data to create analysis!
You can download the demo transformation from my github repo: https://github.com/patlaf/pdi_scripts

Problems with Pentaho Spoon and libgobject

I got some problems with Spoon after upgrading Kubuntu 13.04 to 13.10. Nothing really noticeable at first, but suddenly it starts to randomly crash:

Or

Same thing with different version of Java

I found that starting spoon with a none-GTK3 theme fixes the problem.. It doesn’t crash anymore but I do have a lot of warning related to libgobject in the log.

I tried to change the swt.jar to different versions but it didn’t change anything. I’m stuck with the insane amount of warnings, but at least I can use Spoon !

If you have a solution, please let me know !

References:

The problems you may face when upgrading PDI 4.3 to 5.0.1

Upgrading PDI from one version to another is a pretty easy task. You just have to download the latest version and overwrite your existing “data-integration” folder (in fact, I don’t recommend to overwrite.. create a new copy aside the other!) . There is a few blog posts and an official documentation to do it properly.

However, what all these blogs do not tell is what kind of problem you can expect after the upgrade.  Almost every step from newer version are compatible with the older one. That is true. But there are a few steps you should especially care about.

This is the problems I personally faced.

The first is in Spoon with the Zip File Step. I wasn’t able to open the popup to edit the step.

Pentaho ZipFile Dialog

There is no workaround. You have to delete the step and recreate it. I reported the bug:  Zip file Step from PDI 5.0.1 is not compatible with older versions of PDI. It’s not related to the upgrade, but I also found a bug when you append a file to an existing zip with 5.0.1.

 

The second problem was with the “SQL” step of a job. If the “Send SQL as single statement” is checked, the step produces this error (and nothing more..!):

Make sure you have a semicolon at the end of your statement and uncheck “Send SQL as Single statement”. I didn’t fill a bug report for this one..

 

I sometime have the following error in the log.. It appears randomly and does not produce any obvious problems. So it’s a bit annoying to debug..

I still haven’t found why

 

The last things I noticed is a bit disappointing. I explained it in a previous post. PDI 5.0.1 is a little bit slower than 4.3.

 

References:

Pentaho PDI version 5.0.1 is slower than 4.3

After updating Pentaho PDI 4.3 to 5.0.1, I noticed that my ETL process took 45 minutes longer to execute. So I ask myself if the new version is really slower than the old one ..

I compared the two kettle versions with the same transformation (a single Table Input to Ouput a Table). Same database, same table, same driver, same everything except PDI release (4.3 and 5.0.1).

Pentaho PDI Transformation

Here is the result:

5.0.1  5.0.1 4.3  4.3
Rows/Sec Total Sec Rows/Sec Total Sec
Run #1 33712 50 35864 47
Run #2 31804 53 36643 46
Run #3 33051 51 37458 45
Run #4 33712 50 37458 45
Run #5 33712 50 36643 46
Run #6 31804 53 38309 44
Run #7 33712 50 36643 46
Run #8 32415 52 36643 46
Run #9 31804 53 37458 45
Run #10 31215 54 36643 46

Pentaho Kettle 5.0.1 Benchmark

Yes, Pentaho Kettle 5.0.1 is slower than 4.3. Sad but true, but this is not a deal breaker considering the new features that come with it.

Tuning tips every PDI user using MySQL should know: increase table output speed by 7000%

I found a thread on Pentaho Forum about Table Output Performance.  It gives a really easy way to speed up your table output.. A LOT! Some people have been able to reach 84k rows/sec using the method. It is not a joke. It works.  It really does.

All you need to do is to give the MySQL JDBC driver 3 parameters:

It allows to insert many rows with a single statement instead of one row per statement. Also, useCompression reduce the traffic between the server and pentaho.

I tested it myself.

I used a 869mb CSV file (5248714 rows) from my Laptop and transferred it to an out-of-the-box MySQL installation on a cheap VPS server (1 CPU, 1gb ram) with a poor network in Europe (yep.. across the ocean). The expected performance is a very slow transfer.

I set the number of copies to start of my Table Output to 4 and commit size to 1000.

Here is the result without the options: I reached 52 rows/seconds.

NoOptimisation4TableOutput

The same test, but with the 3 connection parameters: 3674 rows/second ! Wow ! 7000% faster !

Optimisation4TableOutput

Impressive isn’t it !?

 

References:

Calling a (sub)transformation inside a transformation with Pentaho PDI

I recently discovered a great and powerful feature inside Pentaho Kettle. It’s the Mapping step. It allows you to call a transformation inside another transformation.

It is not intended to call a transformation the same way you do with a Job. The way I see it is like a way to create a new Step. If you have a developer background, I could refer to a function. If you have a sequence of steps you repeat often, create a mapping with that sequence! It’s a really nice shortcut.

Let’s take a closer look at how it works.

Suppose you have a transformation to retrieve the total and free ram of the server running the transformation, calculate the quantity of used ram (total – free = used) and write it to log. The transformation should look like this:

getRamTransformation

To create the mapping, you have to create a new transformation with 2 specific steps: the Mapping Input Specification and the Mapping Output Specification. These steps allow the parent transformation to pass values to the sub-transformation (the mapping) and get the results as output fields. In the example, I will create a mapping that gets the total and free ram as input, calculate the used memory and convert it to MB and send it back to the parent transformation. The transformation should look like this:

calcUsedRamTransformation

As you can see, the exact same 2 steps from the first transformation have been duplicated in the sub-transformation. The Mapping Input specifies the 2 fields that will receive the values from the parent transformation. The name of the fields don’t have to be the same from parent transformation.

mappingInput

In the first transformation, the 2 steps can be replaced by the mapping step.

getRamTransformationMapping

In the Input Tab of the step, the fields from parent transformation have to be mapped to the fields declared in the sub-transformation.

mappingSubTransformation

There we go ! It’s that easy !

You can download the transformation used in the screenshot

 

Don’t try to fool the transformation execution.

All the rules regarding the execution sequence still apply!

  • It is not possible to set a variable in the sub-transformation and get it in the parent transformation.
  • All the input steps start at the same time, regardless if the input is in the parent or sub transformation

Since I discovered the mapping, i’m using it a lot to make my etl processes way more modular.