The Ontario Progressive Conservative Party Leadership Race On Twitter
The Ontario Progressive Conservative Party is putting on quite a show these days. As a side project, a colleague and I decided to capture the tweets to the dominant hashtags #pcpo
and #pcpoldr
, for history’s sake. Neither of us are experts at sentiment analysis; but we have done a little bit and this is just too juicy.
One note: I made a first stab at getting tweets using the twitteR package; but this proved to be less than optimal because it does not download full tweets with more than 140 characters. so then I switched to the rtweet package which, honestly, looks like a substantial improvement.
After some fiddling getting my access tokens set up; one command was enough to get the tweets from the Sunday, February, 18th to today. Going forward, I’ll add to these.
Here’s how you grab all the hashtags from #pcpoldr
or #pcpo
. If you ask for more than 18000 tweets, you have to set retryonratelimit
to TRUE
. This takes a while to run, so I’m not executing this.
You can also work with the results of search_tweets function right away. However, because I’m downloading tweets every couple of days, I prefer to save them as a CSV file, redo the search in a day or two and read everything in and combine them. Here’s how I did that.
The save_as_csv
command is a nice command that produces two separate but paired CSV files. One has each tweet and data associated with it (e.g. the date created, the text, whether it’s a retweet) and the other has the user data of the person who generated the tweet (e.g. screen name, follower count, etc.)
Note that I’m saving these in a subfolder of my working directory and I’m giving naming the saved files with the date and time stamp of the most recent tweet. That way I’ll know where the file stops.
After I’ve conducted the search a couple of times, I’m ready to read in all the files and combine them.
First, I list the files that are in the Rtweets subfolder.
## [1] "Rtweets/2018-02-28-tweets.csv"
## [2] "Rtweets/2018-02-28-users.csv"
## [3] "Rtweets/2018-03-02 17-50-09.users.csv"
## [4] "Rtweets/2018-03-02-17-50-09.tweets.csv"
## [5] "Rtweets/2018-03-09 17-19-49.tweets.csv"
## [6] "Rtweets/2018-03-09 17-19-49.users.csv"
## [7] "Rtweets/2018-03-09 17-59-39.tweets.csv"
## [8] "Rtweets/2018-03-09 17-59-39.users.csv"
## [9] "Rtweets/2018-03-11 16-59-42.tweets.csv"
## [10] "Rtweets/2018-03-11 16-59-42.users.csv"
## [11] "Rtweets/2018-03-12 16-59-38.tweets.csv"
## [12] "Rtweets/2018-03-12 16-59-38.users.csv"
## [13] "Rtweets/2018-03-20 23-55-04.tweets.csv"
## [14] "Rtweets/2018-03-20 23-55-04.users.csv"
Then, I read each file in and bind them together.
#read libraries
library(readr)
library(dplyr)
#Read in files
tweets<-lapply(files[grep('tweets.csv', files)], read_csv) %>%
bind_rows()
Part of what is returned is a variable created_at
. We just need to turn it into a datetime variable and round it to the hour to plot it hourly.
#load lubridate library
library(lubridate)
#Format date variable
tweets$Date<-ymd_hms(tweets$created_at, tz='UTC')
#Round to the hour
tweets$Date<-round_date(tweets$Date, 'hour')
Then I group the tweets by date, count how many there are and plot the counts.
tweets %>%
group_by(Date) %>%
summarize(freq=n()) %>%
ggplot(., aes(x=Date, y=freq))+geom_line()+
scale_x_datetime(date_breaks='1 days',date_minor_breaks='6 hours', date_labels='%m %d %H', timezone='EST')+
theme(axis.text.x = element_text(angle=90))+ylim(c(0,500))
So obviously things were pretty steady and cyclical on a daily basis through the first days of the leadership race until February 26th, when Patrick Brown withdrew after resigning and then reentering the race. For anyone reading this who is not from Canada, yeah, it’s been pretty wild. Best just to read the Wikipedia entry.
There are a few things that are interesting here to me though beyond that.
First, it looks to me like the Twitter conversation about a leadership race has a daily news cycle to it, with one peak every day. However, every day has a different peak. Sometimes it’s early morning, sometimes it’s late at night. I had sort of thought that there might be a little more regularity in this, that you might see people going online and posting something routinely, like just before htey left home from work or just after supper or on their lunch break. But this seems like people go on to twitter as events warrant, once a day.
The other thing to notice is that the biggest peak was during the debate on March 1st. A little more analysis will tell us what drove that peak; i.e. was it partisans promoting their candidates? Journalists weighing in? Or god forbid regular people talking to other regular people? So the debate actually garnered more twitter activity than the bombshell of brown’s resignation; and so did the post-debate activity on March 2nd.
One other quick thing we can do is to find who the most frequent tweeters are on the hashtags.
## # A tibble: 20 x 2
## screen_name n
## <chr> <int>
## 1 5BobbyArmstrong 1426
## 2 KrankyKanuck 1037
## 3 AORRHLiberals 765
## 4 PCPONewsWatch 734
## 5 AwuniAlbGene 727
## 6 EdFlint2 654
## 7 Madhattersdes 622
## 8 pqpolitics 596
## 9 dneilmckay 591
## 10 nrbruns 550
## 11 WaldingerTrudy 525
## 12 dennisfurlan 494
## 13 LyndaE222 480
## 14 MuskokaMoneybag 455
## 15 geoff_bernz 433
## 16 BobBaileyPC 412
## 17 Uranowski 409
## 18 happylou12 389
## 19 TGranicAllen 388
## 20 mdr1607 361
My favourite is probably MuskokaMoneybag. Classy. Of course it’s moderately interesting that Tanya Allen is the only leadership contender to appear in the Top 20. There might be something to be said about Twitter being a crucial resource for the outsider candidate.
I’ll be playing around with this in the next few weeks.