Twitter Stream Exploration
Twitter is a great source of live data, which can be used in many different areas, all concurring to the core of it’s functionality: social data. In this exploratory analysis I will go from extracting twitter data to doing some basic analysis on it.
For a live exploration of twitter data, the stream must be accessed. For more information on the twitter stream take a look at the documentation.
For this study the StreamListener from the tweetpy library will be used. I will also be using TextBlob for sentiment analysis and pandas, json and matplotlib for handling data.
from tweepy.streaming import StreamListener
from tweepy import Stream
from tweepy import OAuthHandler
from textblob import TextBlob
import json
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
Import keys
I usually have a git-ignored file with personal configurations in my repos. In this case the file is myKeys.py
and contains my oauth keys. These are my personal keys, you can get yours here.
import myKeys
api_key = myKeys.api_key
api_secret = myKeys.api_secret
access_token_key = myKeys.access_token_key
access_token_secret = myKeys.access_token_secret
Create a listener to handle everey tweet
This class defines a handler for the tweet event. This class is the spine of this exploration’s processing. It receives every tweet and loads it from the json format, along with the tweet sentiment into a pandas DataFrame. It adds the score of every tweet to a sentimentIntegral variable that stores the result score of the labels being tracked.
class ColorListener(StreamListener):
def __init__(self):
self.sentimentIntegral = 0
self.tweets = pd.DataFrame(columns=('tweet', 'sentiment'))
def on_data(self, data):
try:
tweet = json.loads(data)
blob = TextBlob(tweet['text'])
self.sentimentIntegral += blob.sentiment[0]
print "{0:.2f}".format(round(blob.sentiment[0],2)), "{0:.2f}".format(round(self.sentimentIntegral,2))
row = pd.Series([tweet['text'], blob.sentiment[0]], index=['tweet', 'sentiment'])
self.tweets = self.tweets.append(row, ignore_index=True)
except UnboundLocalError:
raise UnboundLocalError
except:
pass
return True
def getTotalScore(self):
return self.sentimentIntegral
def on_error(self, status):
print "Error: ", status
Instance and running
The listener object is created and hooked to the twitter stream with the proper authentication. The stream is later filtered with a specific search term, which is selected to represent a specific social phenomena.
cListener = ColorListener()
auth = OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token_key, access_token_secret)
stream = Stream(auth, cListener)
# Start reading stream for english tweets with the color words
stream.filter(languages=['en'], track=['red', 'green','blue'])
Analize DataFrame
Now we hawe a pandas dataframe inside the cListener object we can analyze.
df = cListener.tweets
print len(df.index) # Number of rows
1115
df.head() # How the data looks like
tweet | sentiment | |
---|---|---|
0 | RT @5REDVELVET: [OFFICIAL] 160315 RED VELVET #... | -0.375000 |
1 | RT @WampsBraintree: Wamp Train scheduled for 5... | 0.000000 |
2 | RT @thickred3x: Order Red & @JovanJordanXX... | 0.000000 |
3 | @lipdistrikt Thks 4 following! - Please vote f... | 0.166667 |
4 | RT @UFCONFOX: Dustin Poirier vs. Bobby Green j... | -0.200000 |
df.plot(figsize=(16, 8)) # Plot the sentiment as a time series
df['sentiment'].plot.kde(figsize=(16, 8))
Mining text
The regular expressions library will be used to mine text
import re
And a function created to separate the different terms tracked
def wordInText(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return 1
return 0
Create new columns corresponding to term in tweet.
# New column equals applying the wordInText function to every element of the column text
df['red'] = df['tweet'].apply(lambda tweet: wordInText('red', tweet))
df['green'] = df['tweet'].apply(lambda tweet: wordInText('green', tweet))
df['blue'] = df['tweet'].apply(lambda tweet: wordInText('blue', tweet))
df.head() # How the data looks like
tweet | sentiment | red | green | blue | |
---|---|---|---|---|---|
0 | RT @5REDVELVET: [OFFICIAL] 160315 RED VELVET #... | -0.375000 | 1 | 0 | 0 |
1 | RT @WampsBraintree: Wamp Train scheduled for 5... | 0.000000 | 1 | 0 | 0 |
2 | RT @thickred3x: Order Red & @JovanJordanXX... | 0.000000 | 1 | 0 | 0 |
3 | @lipdistrikt Thks 4 following! - Please vote f... | 0.166667 | 1 | 0 | 0 |
4 | RT @UFCONFOX: Dustin Poirier vs. Bobby Green j... | -0.200000 | 0 | 1 | 0 |
print df['red'].value_counts()
0 654
1 461
Name: red, dtype: int64
df[['red','green','blue']].sum()
red 461
green 249
blue 410
dtype: int64
df[['red','green','blue']].sum().plot(kind='bar',color=['r','g','b'],figsize=(16, 8))