Query Twitter API with Pentaho PDI (Kettle)

There are a lot of libraries or tools to query the Twitter API using Python, PHP, Javascript and other programming languages, but there is nothing to query it using Pentaho PDI (aka Kettle). So with a little research and time, i successfully implemented the OAUTH authentication required to use the Twitter REST API using native Kettle steps. I can now query Twitter with Pentaho Data Integration !

1 – Register an application with Twitter to get the your API and Consumer Key

Before stepping into Kettle, you will need to register an Application to obtain your OAUTH Access Token. Create an application using the “tokens from dev.twitter.com” (follow the steps from the previous link.. this is pretty straight forward). Once the application registered, you have the minimum information to build the OAUTH request:

  • API Key
  • API Secret
  • Access Token
  • Access Token Secret

The information can be found in the API Keys tab of your Application.Twitter Access Token

2- Understanding how it works

The Twitter API is based on HTTP Protocol and authentication relies on OAUTH Protocol. So basically, all the HTTP requests have to provide OAUTH information in the HTTP Authorization Header as per the image below:

Twitter API HTTP Request

 I suggest to read the Authorization Request page of the API Documentation to understand the specifications. Read it twice. The challenge with Kettle is to generate the HTTP Header.

3- Create a transformation

This is where the fun starts! Now that you have read Authorization Request page twice, you know that some information like OAUTH_NONCE and OAUTH_TIMESTAMPS have to be generated for each request. Before stepping into details, let’s see the big picture with an example.

I will do a request using the Search/Tweet call and search for the word “#pentaho”.

  1. I need the URL where to send the request
  2. I need specific parameters to the Search/Tweet call
  3. I need to generate the HTTP Header
  4. I need to send the HTTP request
  5. I need to extract the data from the JSON result

The transformation should look like:

Pentaho Kettle Twitter API Call

Let’s see each step in details

3.1 – Data Grid

I use 2 different Data Grids, the first is all the static data needed to create OAUTH authentication. The second is the API URI and HTTP Method used to do the call. The OAUTH data grid is always the same whatever the call is, so it can be copied/pasted in other transformation if needed. It also allows creating a more advanced transformation using subtransformation.

Pentaho PDI Twitter DataGrid

 3.2 – Modified Java Script Value – Generate Header

It is possible to achieve almost anything with the Java Script step. I used it to generate the HTTP Header with 2 libraries I found on Internet. All the credit goes to Netflix Inc and Paul Johnston for publishing the  libraries that do the magic. I just added a few lines at the end of the script.

Make sure to edit the variable URL_PARM at line 763 to reflect the specific parameters corresponding to the API Call.

For the Search/Tweets call, the parameters are
https://api.twitter.com/1.1/search/tweets.json?q=%23pentaho&count=4&result_type=mixed

So the line 763 will be

encodeURIComponent is used to “URL ENCODE” the string.

Here is the full script.

 

The oauth_xxxxxx variables come from previous Data Grids. The final result is the REST_HEADER and REST_URL. Both variables are sent to next transformation’s step (REST Client).

3.3 – Rest Client

The Rest Client step is pretty simple because all parameters have been defined in the previous steps. The Headers tab enables you to define the content of any HTTP headers using an existing field. This is where the REST_HEADER variable is defined as Authentification header.

Pentaho Kettle Twitter API RestClient Pentaho Kettle Twitter API Rest Client Header

 

The answer from Twitter is sent to the “result” variable. It could be the expected answer but it could be an error message too. A good error handling mechanism has to be developed. For the sake of the demo, I assume getting the expected answer.

3.4 – JSON and Text Output

Twitter sends the result with a JSON format. The first “Text Output Raw Json” step is a debugging step I use to develop the transformation. It’s very useful to have the original JSON answer to build a good JsonPath expression in the JSON step.

For the demo, I created 2 CSV, the first for the twit message and the second for each hashtags of each twit message.

Conclusion

As you can see, it’s pretty easy to retrieve data from Twitter. You can call another API’s function by changing the URI (in the datagrid), the parameter string (in Javascript) and the Jsonpath (JSON Step). And there you go ! You now have data to create analysis!
You can download the demo transformation from my github repo: https://github.com/patlaf/pdi_scripts