Charts, R, Text Mining, Twitter

Comparison of male and female sentiment in tweets

In this post we will get some tweets based on a keyword then get users’ real name to infer the gender so that we can run a sentiment analysis and see if see if it varies between men and women.

We’ll be using the twitteR and text mining packages as well as the gender and ggplot2 packages.

I got the idea from examples used in the qdap package.

Let’s load the packages, authenticate and pull some tweets.

libs <- c("gender", "twitteR", "qdap", "ggplot2", "tm")
lapply(libs, library, character.only=TRUE)

#Register Twitter OAuth
source ("D:/Twitter_OAuth_1_1.R")

#Define search terms
search_terms <- c("lobster", "heaven")

#Pull tweets
tw <- data.frame()
for (i in 1:length(search_terms)){
  print(paste("Searching for: ",search_terms[i]))
  t <- searchTwitter(search_terms[i], n=round(2000/length(search_terms)), lang="en")
  print(paste(length(t), " results found for the keyword: ", search_terms[i], sep=""))
  tw <- rbind(twListToDF(t), tw)

– Most of the time I will use a print() function as it’s always good to know what is going on when running for() loops –

Now we will get the real name of users involved in the conversation. Note that this takes some time due to API limitations. We can query 180 real name/15 minutes – check getCurRateLimitInfo(). This is why I put a Sys.sleep() of five second.

1 query / 5 seconds -> ± 1h40 min for 2000 names.

This will return a lot of rubbish, I’m afraid people are not keen on giving their real name and some accounts might be “protected”.

#Add Name
tw$name <- 0

#get users' real names
user <- tw$screenName

#try() as some fail due to privacy issues.
#if to check if object exists and is 'filled'
for (i in 1:length(user)) {
  try(usr <- getUser(as.character(user[i])))
  if(exists("usr") && length(usr)){
    n <- usr$getName()
    print(paste(user[i], "'s real name is", n, sep=" "))
    tweets$name[i] <- n

Now that we have user’s real name we can infer the gender. Pretty straight forward we can specify a range of years to the function to calculate the proportion of male and female based on names for that time period.

We also have to split the real name with strsplit() since we infer genders from first names and not second names … obviously.

#Remove unknown names = 0
df2 <- tw[tw$name != 0,]

# Split Name - Only take first name
df2$fn <- lapply(strsplit(as.character(df2$name), "\\ "), "[", 1)
g <- gender(as.character(df2$fn), years = c(1960, 2012))
df2$gender <- g$gender
df2 <- df2[!is.na(df2$gender),]

Now we have to run the sentiment analysis just like we have done in a previous post. I often run into issues with the UTF-8 encoding of strange characters like &%你好中国人朋友 which break the score.sentiment() function. To by-pass this I turn the tweets (df2$text) into a corpus to run tm_map() from the tm package. I can then run score the sentiment on the corpus’ documents and append the scores to the dataframe of tweets.

Text is intrinsically dirty and most of the cleaning happens at the corpus level. To do this properly we should also stem the document but this is quite time consuming and tricky when dealing with text in various languages.

#Load Lexicons & Algorithm
pos <- readLines ('positive-words.txt')
neg <- readLines ('negative-words.txt')

#Turn to corpus for cleaning
corpus <- Corpus(VectorSource(df2$text))

#Clean the text while we're are it
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, function(x) gsub('@[[:alnum:]]*', '', x))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) gsub('http[[:alnum:]]*', '', x))
corpus <- tm_map(corpus, removeWords, c(stopwords('english')))

#replace non-convertible bytes in corpus with strings showing their hex codes
tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte"))

#Analyse Sentiment
scores <- score.sentiment(corpus, pos, neg, .progress='text')

df2$score <- scores$score

Now we can plot the results.

#Boxplot sentiment
# colors
cols = c("#312FFF", "#E839DB")
names(cols) = c("male", "female")

# boxplot
ggplot(df2, aes(x=2, y=score, group=gender)) +
  geom_boxplot(aes(fill=gender)) +
  scale_fill_manual(values=cols) +
              position=position_jitter(width=0.2), alpha=0.3)


There you go!

You can get whole code from my git.

TIP: check kuler to find some great colors.



No comments yet.


Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: