Twitter is a great source of live data, which can be used in many different areas, all concurring to the core of it’s functionality: social data. In this exploratory analysis I will go from extracting twitter data to doing some basic analysis on it.

For a live exploration of twitter data, the stream must be accessed. For more information on the twitter stream take a look at the documentation.

For this study the StreamListener from the tweetpy library will be used. I will also be using TextBlob for sentiment analysis and pandas, json and matplotlib for handling data.

from tweepy.streaming import StreamListener
from tweepy import Stream
from tweepy import OAuthHandler

from textblob import TextBlob

import json
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

Import keys

I usually have a git-ignored file with personal configurations in my repos. In this case the file is myKeys.py and contains my oauth keys. These are my personal keys, you can get yours here.

import myKeys

api_key = myKeys.api_key
api_secret = myKeys.api_secret
access_token_key = myKeys.access_token_key
access_token_secret = myKeys.access_token_secret

Create a listener to handle everey tweet

This class defines a handler for the tweet event. This class is the spine of this exploration’s processing. It receives every tweet and loads it from the json format, along with the tweet sentiment into a pandas DataFrame. It adds the score of every tweet to a sentimentIntegral variable that stores the result score of the labels being tracked.

class ColorListener(StreamListener):

    def __init__(self):
        self.sentimentIntegral = 0
        self.tweets = pd.DataFrame(columns=('tweet', 'sentiment'))

    def on_data(self, data):
        try:
            tweet = json.loads(data)
            blob = TextBlob(tweet['text'])
            self.sentimentIntegral += blob.sentiment[0]
            print "{0:.2f}".format(round(blob.sentiment[0],2)), "{0:.2f}".format(round(self.sentimentIntegral,2))
            row = pd.Series([tweet['text'], blob.sentiment[0]], index=['tweet', 'sentiment'])
            self.tweets = self.tweets.append(row, ignore_index=True)
        except UnboundLocalError:
            raise UnboundLocalError
        except:
            pass
        return True

    def getTotalScore(self):
        return self.sentimentIntegral

    def on_error(self, status):
        print "Error: ", status

Instance and running

The listener object is created and hooked to the twitter stream with the proper authentication. The stream is later filtered with a specific search term, which is selected to represent a specific social phenomena.

cListener = ColorListener()
auth = OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token_key, access_token_secret)

stream = Stream(auth, cListener)

# Start reading stream for english tweets with the color words
stream.filter(languages=['en'], track=['red', 'green','blue'])

Analize DataFrame

Now we hawe a pandas dataframe inside the cListener object we can analyze.

df = cListener.tweets
print len(df.index) # Number of rows
1115
df.head() # How the data looks like
tweet sentiment
0 RT @5REDVELVET: [OFFICIAL] 160315 RED VELVET #... -0.375000
1 RT @WampsBraintree: Wamp Train scheduled for 5... 0.000000
2 RT @thickred3x: Order Red & @JovanJordanXX... 0.000000
3 @lipdistrikt Thks 4 following! - Please vote f... 0.166667
4 RT @UFCONFOX: Dustin Poirier vs. Bobby Green j... -0.200000
df.plot(figsize=(16, 8)) # Plot the sentiment as a time series

png

df['sentiment'].plot.kde(figsize=(16, 8))

png

Mining text

The regular expressions library will be used to mine text

import re

And a function created to separate the different terms tracked

def wordInText(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return 1
    return 0

Create new columns corresponding to term in tweet.

# New column equals applying the wordInText function to every element of the column text
df['red'] = df['tweet'].apply(lambda tweet: wordInText('red', tweet))
df['green'] = df['tweet'].apply(lambda tweet: wordInText('green', tweet))
df['blue'] = df['tweet'].apply(lambda tweet: wordInText('blue', tweet))
df.head() # How the data looks like
tweet sentiment red green blue
0 RT @5REDVELVET: [OFFICIAL] 160315 RED VELVET #... -0.375000 1 0 0
1 RT @WampsBraintree: Wamp Train scheduled for 5... 0.000000 1 0 0
2 RT @thickred3x: Order Red & @JovanJordanXX... 0.000000 1 0 0
3 @lipdistrikt Thks 4 following! - Please vote f... 0.166667 1 0 0
4 RT @UFCONFOX: Dustin Poirier vs. Bobby Green j... -0.200000 0 1 0
print df['red'].value_counts()
0    654
1    461
Name: red, dtype: int64
df[['red','green','blue']].sum()
red      461
green    249
blue     410
dtype: int64
df[['red','green','blue']].sum().plot(kind='bar',color=['r','g','b'],figsize=(16, 8))

png