Query Twitter Streaming API with Pentaho Data-Integration and R

On one of my previous posts (Query Twitter Api With Pentaho PDI), many people asked for a way to use the Twitter Streaming API with Pentaho PDI.  Implementing OAUTH and API calls with Kettle can be very difficult compared to the few lines of code required with many different programming languages. Even if I’m pretty sure it’s possible to do it with native steps , I decided to use R to make the call and parse the results with Kettle. This is way faster and easier. Here is how it works.

Kettle Twitter Streaming API

The Transformation itself is very simple.

  1. The Data Grid step provides the R script name to execute with a few parameters like the researched keyword and the stream timeout.
  2. The Concat Fields step concatenate all parameters into 1 variable (one command line)
  3. The Execute a Process execute the R script that calls the Streaming API.
  4. The result is sent back to Kettle and parsed with the JSON Input Step
  5. Finally, the tweets are saved to a CSV file

Once triggered, the “Execute a process” step executes the R script and uses “sdtout” and “sterr” to share the data. In our case, the R script output a JSON string (the result of the streaming api) to stdout.

This transformation allows the use of any other processes/scripts able to query the streaming API and output  the result to STDOUT. You can find the Transformation and the R script in my github repository.

The R Script

R is a free software programming language and software environment for statistical computing. You can download it on the official website, or it can be found in Ubuntu repository with apt-get.

The script uses the libraries streamR and ROAuth. As their name suggests, streamR provides functions to access Twitter’s filter, sample, and user stream and ROAuth provides generic functions to handle OAuth handshake and signature.

You need to have your consumer key and consumer secret from Twitter. If you don’t already have it, see my previous post.

You need to run the script at least one time before using it with PDI. In R, you can call the script using the source(‘/path/to/script.R’) function. It is required to generate the OAUTH signature.

Kettle Twitter Streaming API OAUTH

The handshake() function will open a page to authorize “the application” to access the data. Then in R , you will be prompted to enter the key that twitter just gave you when you authorized.

Twitter Streaming API Authorize Twitter Streaming API Authorize PINThe generated OAUTH signature will be saved into a file called pdi_R_twitter_oauth.Rdata. The data saved is reused for further api calls, so you don’t need to authorize the application each time. By default, the file is saved in the working directory of R. You can get it with the function getwd().

R getwd function

However, to make it work with Pentaho PDI, you need to find the file and move (or copy) it to your data-integration folder.

Pentaho Data-Integration-Folder

A Quick Recap

  1. Download and install R
  2. Run the script at least one time to generate pdi_R_twitter_oauth.Rdata
  3. Copy pdi_R_twitter_oauth.Rdata to your Pentaho data-integration folder

The PDI Transformation should now be able to use the R script to query the Twitter Streaming API! You can download it in my github repository. Note that it has been developed to work on Linux, but it should work on Windows too !

References:

Query Twitter API with Pentaho PDI (Kettle)

There are a lot of libraries or tools to query the Twitter API using Python, PHP, Javascript and other programming languages, but there is nothing to query it using Pentaho PDI (aka Kettle). So with a little research and time, i successfully implemented the OAUTH authentication required to use the Twitter REST API using native Kettle steps. I can now query Twitter with Pentaho Data Integration !

1 – Register an application with Twitter to get the your API and Consumer Key

Before stepping into Kettle, you will need to register an Application to obtain your OAUTH Access Token. Create an application using the “tokens from dev.twitter.com” (follow the steps from the previous link.. this is pretty straight forward). Once the application registered, you have the minimum information to build the OAUTH request:

  • API Key
  • API Secret
  • Access Token
  • Access Token Secret

The information can be found in the API Keys tab of your Application.Twitter Access Token

2- Understanding how it works

The Twitter API is based on HTTP Protocol and authentication relies on OAUTH Protocol. So basically, all the HTTP requests have to provide OAUTH information in the HTTP Authorization Header as per the image below:

Twitter API HTTP Request

 I suggest to read the Authorization Request page of the API Documentation to understand the specifications. Read it twice. The challenge with Kettle is to generate the HTTP Header.

3- Create a transformation

This is where the fun starts! Now that you have read Authorization Request page twice, you know that some information like OAUTH_NONCE and OAUTH_TIMESTAMPS have to be generated for each request. Before stepping into details, let’s see the big picture with an example.

I will do a request using the Search/Tweet call and search for the word “#pentaho”.

  1. I need the URL where to send the request
  2. I need specific parameters to the Search/Tweet call
  3. I need to generate the HTTP Header
  4. I need to send the HTTP request
  5. I need to extract the data from the JSON result

The transformation should look like:

Pentaho Kettle Twitter API Call

Let’s see each step in details

3.1 – Data Grid

I use 2 different Data Grids, the first is all the static data needed to create OAUTH authentication. The second is the API URI and HTTP Method used to do the call. The OAUTH data grid is always the same whatever the call is, so it can be copied/pasted in other transformation if needed. It also allows creating a more advanced transformation using subtransformation.

Pentaho PDI Twitter DataGrid

 3.2 – Modified Java Script Value – Generate Header

It is possible to achieve almost anything with the Java Script step. I used it to generate the HTTP Header with 2 libraries I found on Internet. All the credit goes to Netflix Inc and Paul Johnston for publishing the  libraries that do the magic. I just added a few lines at the end of the script.

Make sure to edit the variable URL_PARM at line 763 to reflect the specific parameters corresponding to the API Call.

For the Search/Tweets call, the parameters are
https://api.twitter.com/1.1/search/tweets.json?q=%23pentaho&count=4&result_type=mixed

So the line 763 will be

encodeURIComponent is used to “URL ENCODE” the string.

Here is the full script.

 

The oauth_xxxxxx variables come from previous Data Grids. The final result is the REST_HEADER and REST_URL. Both variables are sent to next transformation’s step (REST Client).

3.3 – Rest Client

The Rest Client step is pretty simple because all parameters have been defined in the previous steps. The Headers tab enables you to define the content of any HTTP headers using an existing field. This is where the REST_HEADER variable is defined as Authentification header.

Pentaho Kettle Twitter API RestClient Pentaho Kettle Twitter API Rest Client Header

 

The answer from Twitter is sent to the “result” variable. It could be the expected answer but it could be an error message too. A good error handling mechanism has to be developed. For the sake of the demo, I assume getting the expected answer.

3.4 – JSON and Text Output

Twitter sends the result with a JSON format. The first “Text Output Raw Json” step is a debugging step I use to develop the transformation. It’s very useful to have the original JSON answer to build a good JsonPath expression in the JSON step.

For the demo, I created 2 CSV, the first for the twit message and the second for each hashtags of each twit message.

Conclusion

As you can see, it’s pretty easy to retrieve data from Twitter. You can call another API’s function by changing the URI (in the datagrid), the parameter string (in Javascript) and the Jsonpath (JSON Step). And there you go ! You now have data to create analysis!
You can download the demo transformation from my github repo: https://github.com/patlaf/pdi_scripts