The Ontario Progressive Conservative Party Leadership Race On Twitter

The Ontario Progressive Conservative Party is putting on quite a show these days. As a side project, a colleague and I decided to capture the tweets to the dominant hashtags #pcpo and #pcpoldr, for history’s sake. Neither of us are experts at sentiment analysis; but we have done a little bit and this is just too juicy.

One note: I made a first stab at getting tweets using the twitteR package; but this proved to be less than optimal because it does not download full tweets with more than 140 characters. so then I switched to the rtweet package which, honestly, looks like a substantial improvement.

After some fiddling getting my access tokens set up; one command was enough to get the tweets from the Sunday, February, 18th to today. Going forward, I’ll add to these.

Here’s how you grab all the hashtags from #pcpoldr or #pcpo. If you ask for more than 18000 tweets, you have to set retryonratelimit to TRUE. This takes a while to run, so I’m not executing this.

You can also work with the results of search_tweets function right away. However, because I’m downloading tweets every couple of days, I prefer to save them as a CSV file, redo the search in a day or two and read everything in and combine them. Here’s how I did that.

The save_as_csv command is a nice command that produces two separate but paired CSV files. One has each tweet and data associated with it (e.g. the date created, the text, whether it’s a retweet) and the other has the user data of the person who generated the tweet (e.g. screen name, follower count, etc.)

Note that I’m saving these in a subfolder of my working directory and I’m giving naming the saved files with the date and time stamp of the most recent tweet. That way I’ll know where the file stops.

After I’ve conducted the search a couple of times, I’m ready to read in all the files and combine them.

First, I list the files that are in the Rtweets subfolder.

##  [1] "Rtweets/2018-02-28-tweets.csv"         
##  [2] "Rtweets/2018-02-28-users.csv"          
##  [3] "Rtweets/2018-03-02 17-50-09.users.csv" 
##  [4] "Rtweets/2018-03-02-17-50-09.tweets.csv"
##  [5] "Rtweets/2018-03-09 17-19-49.tweets.csv"
##  [6] "Rtweets/2018-03-09 17-19-49.users.csv" 
##  [7] "Rtweets/2018-03-09 17-59-39.tweets.csv"
##  [8] "Rtweets/2018-03-09 17-59-39.users.csv" 
##  [9] "Rtweets/2018-03-11 16-59-42.tweets.csv"
## [10] "Rtweets/2018-03-11 16-59-42.users.csv" 
## [11] "Rtweets/2018-03-12 16-59-38.tweets.csv"
## [12] "Rtweets/2018-03-12 16-59-38.users.csv" 
## [13] "Rtweets/2018-03-20 23-55-04.tweets.csv"
## [14] "Rtweets/2018-03-20 23-55-04.users.csv"

Then, I read each file in and bind them together.

#read libraries
library(readr)
library(dplyr)

#Read in files
tweets<-lapply(files[grep('tweets.csv', files)], read_csv) %>% 
  bind_rows()

Part of what is returned is a variable created_at. We just need to turn it into a datetime variable and round it to the hour to plot it hourly.

#load lubridate library
library(lubridate)
#Format date variable
tweets$Date<-ymd_hms(tweets$created_at, tz='UTC')
#Round to the hour
tweets$Date<-round_date(tweets$Date, 'hour')

Then I group the tweets by date, count how many there are and plot the counts.

tweets %>% 
  group_by(Date) %>% 
  summarize(freq=n()) %>% 
ggplot(., aes(x=Date, y=freq))+geom_line()+
  scale_x_datetime(date_breaks='1 days',date_minor_breaks='6 hours', date_labels='%m %d %H', timezone='EST')+
  theme(axis.text.x = element_text(angle=90))+ylim(c(0,500))

So obviously things were pretty steady and cyclical on a daily basis through the first days of the leadership race until February 26th, when Patrick Brown withdrew after resigning and then reentering the race. For anyone reading this who is not from Canada, yeah, it’s been pretty wild. Best just to read the Wikipedia entry.

There are a few things that are interesting here to me though beyond that.

First, it looks to me like the Twitter conversation about a leadership race has a daily news cycle to it, with one peak every day. However, every day has a different peak. Sometimes it’s early morning, sometimes it’s late at night. I had sort of thought that there might be a little more regularity in this, that you might see people going online and posting something routinely, like just before htey left home from work or just after supper or on their lunch break. But this seems like people go on to twitter as events warrant, once a day.

The other thing to notice is that the biggest peak was during the debate on March 1st. A little more analysis will tell us what drove that peak; i.e. was it partisans promoting their candidates? Journalists weighing in? Or god forbid regular people talking to other regular people? So the debate actually garnered more twitter activity than the bombshell of brown’s resignation; and so did the post-debate activity on March 2nd.

One other quick thing we can do is to find who the most frequent tweeters are on the hashtags.

## # A tibble: 20 x 2
##    screen_name         n
##    <chr>           <int>
##  1 5BobbyArmstrong  1426
##  2 KrankyKanuck     1037
##  3 AORRHLiberals     765
##  4 PCPONewsWatch     734
##  5 AwuniAlbGene      727
##  6 EdFlint2          654
##  7 Madhattersdes     622
##  8 pqpolitics        596
##  9 dneilmckay        591
## 10 nrbruns           550
## 11 WaldingerTrudy    525
## 12 dennisfurlan      494
## 13 LyndaE222         480
## 14 MuskokaMoneybag   455
## 15 geoff_bernz       433
## 16 BobBaileyPC       412
## 17 Uranowski         409
## 18 happylou12        389
## 19 TGranicAllen      388
## 20 mdr1607           361

My favourite is probably MuskokaMoneybag. Classy. Of course it’s moderately interesting that Tanya Allen is the only leadership contender to appear in the Top 20. There might be something to be said about Twitter being a crucial resource for the outsider candidate.

I’ll be playing around with this in the next few weeks.

comments powered by Disqus

Related