#
console
Data, Instagram, R

Instagram API & R

It does not necessarily come to mind to gather data from Instagram since we mostly think of it as a bank of images. However I figured it may well be the best social media API. It returns extensive information on users, posts and comments and can be queried 5’000 times an hour.

The API doesn’t allow searching for terms but allows searching for #tags which are present in pictures’ captions.

EDIT: There is now an R package for Instagram, InstaR which is available on CRAN and on github thought the package does not feature this stream function.

Authenticate

The authentication process detailed below is basically taken from thinktostart who already made a great post about getting instagram data on users. In this example we search for posts by #tags.

The oauth 2.0 protocol requires authentication. Go on instagram.com, create an account if you haven’t one yet then create an app here. Simply fill in your information, name your app and make sure you set the oauth_redirect_url to http://localhost:1410/.  Take note of your client id, client secret and app name as we’ll need it to authenticate. Run the snippet below to authenticate, your default browser should open and state “Authentication complete. Please close this page and return to R”.


libs <- c("httr", "rjson", "RCurl", "RODBC")
lapply(libs, library, character.only=TRUE)

#replace with your info
app_name <- "your_app_name"
client_id <- "your_client_id"
client_secret <- "your_client_secret"
scope <- "basic"

instagram <- oauth_endpoint(
  authorize = "https://api.instagram.com/oauth/authorize",
  access = "https://api.instagram.com/oauth/access_token")
myapp <- oauth_app(app_name, client_id, client_secret)

ig_oauth <- oauth2.0_token(instagram, myapp,scope="basic", type = "application/x-www-form-urlencoded",cache=FALSE)

The Authentication done. Get the token.

tmp <- strsplit(toString(names(ig_oauth$credentials)), '"')
token <- tmp[[1]][4]

For Windows users like myself; before you can query the API you will need to download the certificate. We’ll need this file in the queries.

## set the directory
setwd("~/your/directory/here")

#download file
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

Get Data

Now we can query the API (wehey). Each query returns twenty medias (pictures/posts). Responses are in JSON and need to be parsed to a list which is then split into four tables:

  • Data on posts (object named ‘med’)
  • Data on user who posted (object named ‘prof’)
  • Data on users tagged in media (object named ‘us_liked’)
  • Data on users who liked media (object named ‘us_tagged’)

The response includes, among other things, the “next url” which is the parameter that will provide you with the next set of data. So I thought it would be ideal to do an initial query to get the “next url” and easily loop through each “next url” for a certain period of time.

  1. Query 20 medias using next url
  2. Get batch of data
  3. Parse and unlist profile data
  4. Parse and unlist media data
  5. Parse and unlist tags data
  6. Parse and unlist likes data
  7. Parse and unlist comments data

Then back to 1. Query 20 medias using next url


#load libraries
libs <- c("rjson", "httr")
lapply(libs, library, character.only=TRUE)

#set tag to query
tag <- "selfie"

#initial query
media <- fromJSON(getURL(paste('https://api.instagram.com/v1/tags/',tag,'/media/recent/?access_token=',token,sep=""),
                         cainfo = "/~your/directory/here/cacert.pem"))

#set dataframes
comments <- data.frame()
prof <- data.frame()
med <- data.frame()
us_tagged <- data.frame()
us_liked <- data.frame()

#initiate loop
#set some later date (when the data collection is to be stopped)
while (Sys.Date() <= as.Date("2015-12-31")) {
  media <- fromJSON(getURL(paste(media[[1]]$next_url,sep=""),
                           cainfo = "/~your/directory/here/cacert.pem"),
                    unexpected.escape = "keep")
  if(length(media$data)){
    
    profile <- data.frame(no = 1:length(media$data))
    medias <- data.frame(no = 1:length(media$data))

    
    for (i in 1:length(media$data)) {
      
      print(paste("getting meta of media", i, "of", length(media$data), sep=" "))
      
      ## PROFILE
      
      profile$screenName[i] <- media$data[[i]]$user$username
      profile$name[i] <- media$data[[i]]$user$full_name
      if (length(media$data[[i]]$location$latitude)) {
        profile$latitude[i] <- media$data[[i]]$location$latitude
        profile$longitude[i] <- media$data[[i]]$location$longitude
      } else {
        profile$latitude[i] <- NA
        profile$longitude[i] <- NA
      }
      if (length(media$data[[i]]$location$name)) {
        profile$location[i] <- as.character(media$data[[i]]$location$name)
      } else {
        profile$location[i] <- NA
      }
      profile$profile_picture[i] <- media$data[[i]]$user$profile_picture
      profile$website[i] <- media$data[[i]]$user$website
      profile$bio[i] <- media$data[[i]]$user$bio
      profile$id[i] <- media$data[[i]]$user$id
      
      ## MEDIAS
      
      if (length(media$data[[i]]$caption$text)) {
        medias$caption[i] <- media$data[[i]]$caption$text
      } else {
        medias$caption[i] <- NA
      }
      medias$nbr_likes[i] <- media$data[[i]]$likes$count
      medias$nbr_comments[i] <- media$data[[i]]$comments$count
      medias$type[i] <- media$data[[i]]$type
      medias$filter[i] <- media$data[[i]]$filter
      medias$created[i] <- toString(as.POSIXct(as.numeric(media$data[[i]]$created_time), origin="1970-01-01"))
      medias$link[i] <- media$data[[i]]$link
      medias$id[i] <- media$data[[i]]$id
      medias$screenName[i] <- media$data[[i]]$user$username
      medias$tags[i] <- paste(media$data[[i]]$tags, collapse=", ")
      
      ## COMMENTS
      
      if (length(media$data[[i]]$comments$data)) {
        for (x in 1:length(media$data[[i]]$comments$data)) {
          screenName <- media$data[[i]]$comments$data[[x]]$from$username
          name <- media$data[[i]]$comments$data[[x]]$from$full_name
          text <- media$data[[i]]$comments$data[[x]]$text
          reply_to <- media$data[[i]]$user$full_name
          reply_to_id <- media$data[[i]]$id
          created <- toString(as.POSIXct(as.numeric(media$data[[i]]$comments$data[[x]]$created), origin="1970-01-01"))
          id <- media$data[[i]]$comments$data[[x]]$id
          comm <- as.data.frame(cbind(screenName, name, text, created, reply_to,
                                          reply_to_id))
          comments <- as.data.frame(rbind(comments, comm))
        }
      } else {      
      }
      
      ## USER TAGGED  
      
      if (length(media$data[[i]]$users_in_photo)) {
        for (y in 1:length(media$data[[i]]$users_in_photo)) {
          y <- media$data[[i]]$users_in_photo[[y]]$position$y
          x <- media$data[[i]]$users_in_photo[[y]]$position$x
          username <- media$data[[i]]$users_in_photo[[y]]$user$username
          name <- media$data[[i]]$users_in_photo[[y]]$user$full_name
          profile_picture <- media$data[[i]]$users_in_photo[[y]]$user$profile_picture
          id <- media$data[[i]]$users_in_photo[[y]]$user$id
          tagged_in <- media$data[[i]]$user$username
          tagged_in_id <- media$data[[i]]$user$id
          us_tag <- as.data.frame(cbind(username, name, id, tagged_in, tagged_in_id,
                                           profile_picture, x, y))
          us_tagged <- as.data.frame(rbind(us_tagged, us_tag))
        }
      } else {
      }
      
      ## USER LIKED
      
      if(length(media$data[[i]]$likes$data)) {
        for (z in 1:length(media$data[[i]]$likes$data)) {
          media_liked_id <- medias$id[i] <- media$data[[i]]$id
          username <- media$data[[i]]$like$data[[z]]$username
          profile_picture <- media$data[[i]]$like$data[[z]]$profile_picture
          id <- media$data[[i]]$like$data[[z]]$id
          name <- media$data[[i]]$like$data[[z]]$full_name
          us_like <- as.data.frame(cbind(username, profile_picture, name, id,
                                          media_liked_id))
          us_liked <- as.data.frame(rbind(us_liked, us_like))
        }
      } else {
      }
      
    }
    
    print("Storing profiles and medias")
    
    prof <- as.data.frame(rbind(prof, profile))
    med <- as.data.frame(rbind(med, medias))
    
  } else {
  }
  Sys.sleep(30)
}
if (Sys.Date() <= as.Date("2015-12-31")) {
  print("Crawl ended")
} else {
  print(paste("Crawl stopped, an error occured at", Sys.time()))
}

Data

Let’s look at the variables we get for each table


#medias
names(med)
[1] "no"           "caption"      "nbr_likes"    "nbr_comments" "type"         "filter"
[7] "created"      "link"         "id"           "screenName"   "tags"

#profiles (bio = user description)
names(prof)
[1] "no"              "screenName"      "name"            "latitude"        "longitude"
[6] "location"        "profile_picture" "website"         "bio"             "id" 

#users tagged in medias
names(us_tagged)
[1] "username"        "name"            "id"              "tagged_in"       "tagged_in_id"
[6] "profile_picture" "x"               "y"  

#comments on media
names(comments)
[1] "screenName"  "name"        "text"        "created"     "reply_to"    "reply_to_id"

#user who like media
names(us_liked)
[1] "username"        "profile_picture" "name"            "id"              "media_liked_id"

There you go!

Advertisements

Discussion

3 thoughts on “Instagram API & R

  1. The number of active users of Instagram worldwide just exceeded the number of Twitter’s. It should mean quite a lot to analyze how people follow and are connected.

    Like

    Posted by cafepai | January 21, 2015, 11:08 am
  2. Thanks again for this great tutorial!

    Does anyone know of a way to interact with the real-time api of Instagram through R?
    I can’t seem to be able to create a subscription as explained here: https://instagram.com/developer/realtime/

    Using the POST() function from the httr package, and the following code, I am able to send the request, but unable to get the answer as the callback url is not accessible to Instagram.

    `r = POST(url = ‘https://api.instagram.com/v1/subscriptions’,
    body = “client_id=XXXXX;client_secret=XXXX;aspect=media;access_token=XXXX;callback_url=http://localhost:1410/;object=tag;object_id=selfie”,
    encode = “form”,
    verbose()
    )
    str(content(r))
    `

    Any idea on how to handle the callback? Or any alternative suggestion?

    Many thanks!

    Like

    Posted by Stephane | March 7, 2015, 8:07 pm
  3. Reblogged this on Dinesh Ram Kali..

    Like

    Posted by dineshramitc | April 8, 2015, 5:20 pm

reply()

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: