Query Twitter Streaming API with Pentaho Data-Integration and R

On one of my previous posts (Query Twitter Api With Pentaho PDI), many people asked for a way to use the Twitter Streaming API with Pentaho PDI.  Implementing OAUTH and API calls with Kettle can be very difficult compared to the few lines of code required with many different programming languages. Even if I’m pretty sure it’s possible to do it with native steps , I decided to use R to make the call and parse the results with Kettle. This is way faster and easier. Here is how it works.

Kettle Twitter Streaming API

The Transformation itself is very simple.

  1. The Data Grid step provides the R script name to execute with a few parameters like the researched keyword and the stream timeout.
  2. The Concat Fields step concatenate all parameters into 1 variable (one command line)
  3. The Execute a Process execute the R script that calls the Streaming API.
  4. The result is sent back to Kettle and parsed with the JSON Input Step
  5. Finally, the tweets are saved to a CSV file

Once triggered, the “Execute a process” step executes the R script and uses “sdtout” and “sterr” to share the data. In our case, the R script output a JSON string (the result of the streaming api) to stdout.

This transformation allows the use of any other processes/scripts able to query the streaming API and output  the result to STDOUT. You can find the Transformation and the R script in my github repository.

The R Script

R is a free software programming language and software environment for statistical computing. You can download it on the official website, or it can be found in Ubuntu repository with apt-get.

The script uses the libraries streamR and ROAuth. As their name suggests, streamR provides functions to access Twitter’s filter, sample, and user stream and ROAuth provides generic functions to handle OAuth handshake and signature.

You need to have your consumer key and consumer secret from Twitter. If you don’t already have it, see my previous post.

You need to run the script at least one time before using it with PDI. In R, you can call the script using the source(‘/path/to/script.R’) function. It is required to generate the OAUTH signature.

Kettle Twitter Streaming API OAUTH

The handshake() function will open a page to authorize “the application” to access the data. Then in R , you will be prompted to enter the key that twitter just gave you when you authorized.

Twitter Streaming API Authorize Twitter Streaming API Authorize PINThe generated OAUTH signature will be saved into a file called pdi_R_twitter_oauth.Rdata. The data saved is reused for further api calls, so you don’t need to authorize the application each time. By default, the file is saved in the working directory of R. You can get it with the function getwd().

R getwd function

However, to make it work with Pentaho PDI, you need to find the file and move (or copy) it to your data-integration folder.

Pentaho Data-Integration-Folder

A Quick Recap

  1. Download and install R
  2. Run the script at least one time to generate pdi_R_twitter_oauth.Rdata
  3. Copy pdi_R_twitter_oauth.Rdata to your Pentaho data-integration folder

The PDI Transformation should now be able to use the R script to query the Twitter Streaming API! You can download it in my github repository. Note that it has been developed to work on Linux, but it should work on Windows too !