#
console
Charts, R, Text Mining

Topic modelling: tweets

In this post we’ll uncover the underlying topics in tweets using the Latent Dirichlet Allocation model; function within the topicmodels package.

Featured packages: tm, twitteR, ggplot2, Snowball and topicmodels.

Let’s get some tweets, see what people say about it two weeks after the referendum.

#load packages
libs <-  c("twitteR", "tm", "topicmodels", "SnowballC", "ggplot2")
lapply(libs, library, character.only=TRUE)

#get tweets in English
tweets <- searchTwitter("Scotland", n=2000, lang="en")

#unlist tweets
df <- twListToDF(tweets)

Once we have the tweets we have to process the text. That means turning the tweets into a corpus, clean the corpus and stem the terms and then create a term-document matrix.

#Turn into Corpus
corpus <- Corpus(VectorSource(df$text))

#Clean Corpus
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, function(x) gsub('@[[:alnum:]]*', '', x))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) gsub('http[[:alnum:]]*', '', x))
corpus <- tm_map(corpus, removeWords, c(stopwords('english'), 'amp','via','and','for',
                                        'from','the', 'burger'))
corpus <- tm_map(corpus, PlainTextDocument)

#Snowball Stemmer
#copy corpus for stem completion
corpus_copy <- corpus

#Stem
corpus <- tm_map(corpus, stemDocument)

#Complete Stem
corpus <- tm_map(corpus, stemCompletion, dictionary=corpus_copy)

#Create Term Document Matrix
tdm <- TermDocumentMatrix(corpus, control=list(minWordLength=1))

Once we have the term-document matrix of tweets we can uncover the topics. First we’ll turn the term-document matrix to a document-term matrix. In this example I have mined 2000 tweets on Scotland and looked for 6 topics.

dtm <- as.DocumentTermMatrix (tdm)

#remove empty documents
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]

# find X topics
lda <- LDA(dtm.new, k = 6)

#Show first 4 terms of topics
term <- terms (lda, 4) 

term <- apply(term, MARGIN = 2, paste, collapse = ", ")

#Plot topics
topics <- topics(lda, 1)
topics <- data.frame(date=tb$created[1:length(topics)], topics)
qplot(date, ..count.., data = topics, geom = "density" , fill = term[topics], position ="stack")

LDA

Advertisements

Discussion

3 thoughts on “Topic modelling: tweets

  1. Great tutorial!

    A few imprecisions in the code however.
    – The data frame “tb” is never referred nor created.
    – Most of the tm functions (e.g., tolower), can’t be applied on tm_map anymore since 0.60 update.
    – Last chunk L8 “lda <- LDA(dtm.new, k = 6)” should be lda <- LDA(dtm.new, k = 6)

    Yet, one of the most practical and easy to use R topic modelling script I have yet come accros.

    Very well done and thanks!

    Like

    Posted by Stephane | March 7, 2015, 9:40 am
    • Hi Stephane,

      Thank you for your comment. I have amended the code.

      – Changed “tb” to “df”
      – After testing I only had to change the “tolower” function to “content_transformer(tolower)”
      – Also changed the last chunk’s encoding error.

      Thanks again for your feedback, much appreciated.

      Like

      Posted by SocialFunction() | March 7, 2015, 9:58 am
      • Thanks for the swift reply.
        The code works great now!
        Two minor things:
        – The “Snowball” package has been archived and replaced by “SnowballC”
        – Oddly, the TermDocumentMatrix() function gives the following error if applied after the “stemDocument” and the “stemCompletion” steps.

        `Error in UseMethod(“meta”, x) :
        no applicable method for ‘meta’ applied to an object of class “character”
        In addition: Warning message:
        In mclapply(unname(content(x)), termFreq, control) :
        all scheduled cores encountered errors in user code`

        Here is a sample dataset to replicate the issue: http://s000.tinyupload.com/index.php?file_id=29609825343301417352

        Omitting the stemming process works fine.

        Thanks again.

        Like

        Posted by Stephane | March 7, 2015, 8:20 pm

reply()

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: