Big Data, How To, Python, R, Twitter

Adding Country Names to Tweets with R

The twitter API gives you longitude and latitude information, but does not categorize tweets by country. I know how to fit coordinates into country polygons using an R package. I tried to find an easy way to do this in python, but was spending a lot of time doing it and so I decided to stick with R!

Using R World Map to attach country names to tweet coordinates:

library(sp)
library(rworldmap)
data(countriesCoarseLessIslands) 

hash <- read.csv('final_tweet.csv')
names(hash) <- c('id', 'lang', 'text', 'long', 'lat')
no_na = na.omit(hash[, c("id", "long", "lat")]) #omit missing values from this dataframe
hash_coords <- SpatialPoints(no_na[2:3]) # make SpatialPoint dataframe from regular coordinates

# assign coordinate/projection system to hash_coords (we need it to be the same as the polygons for the spatial join)
proj4string(hash_coords) <- proj4string(countriesCoarseLessIslands) 

# overlay points and assign to polygon
spatial_join <- over(hash_coords, countriesCoarseLessIslands) 

no_na$sov <- spatial_join$SOVEREIGNT # attaching country to long-lat-hashtag
country_tweet <- merge(hash, no_na, by = "id")

write.csv(country_tweet, file = "final_tweet_country_code.csv", sep = ",", col.names = TRUE, row.names = FALSE)

bonus: Cool commands to help with memory management!

in R:

ls(all.names = TRUE) #to see all user-defined environmental variables
rm("var1", "var2") # to remove variables from the environment

in Python:

xdel("var1") #delete a variable
globals() #see global variables
locals() #see local variables

in linux:

top #shows you percentage of Memory being, shift+m will sort top by memory useage
df -h #shows you what is using your disk space

Write a Reply or Comment

Your email address will not be published.