Big Data, How To, Twitter

Get a Week of Geotagged Tweets

To gather data for my tweet exploration project, I used a cron job to run my python script connected to the Twitter Streaming API continuously for a week on an AWS EC2 instance. I chose to use a cron job because my python script would occasionally stop running or disconnect from the stream due to unexpected errors.

Below is an example crontab file. The five stars tell the cron to run every minute. Every minute, it accesses my shell script that runs my python script, and saves any log output to a text file. You have to initialize and update changes to your crontab file by running the command crontab <crontab_script_name> on the command line.

* * * * * /bin/sh /home/ubuntu/run_always.sh >> /home/ubuntu/log_output.txt 2>&1

Below is my run_always.sh shell script that checks if the file is already running (using regex so grep command does not match). If it is not running, it runs my twitter API stream connection python script.


#!/bin/sh

cd /home/ubuntu/

source config.sh #source credentials for twitter access

ps aux | grep '[g]et_tweet.py' 
if [ $? -ne 0 ]
then
    python get_tweet.py 
fi

The run_always bash script runs my python script connected to the twitter API. To collect geotagged tweets from the twitter API you need to have to filter by a coordinate polygon. To gather all geotagged tweets, filter by a polygon that covers all coordinates

import sys import tweepy import os import json

# Set Twitter authentication keys

CONSUMER_KEY = os.environ.get('CONSUMER_KEY') CONSUMER_SECRET = os.environ.get('CONSUMER_SECRET') ACCESS_KEY = os.environ.get('ACCESS_KEY') ACCESS_SECRET = os.environ.get('ACCESS_SECRET')

class CustomStreamListener(tweepy.streaming.StreamListener):

    def on_data(self, raw_data):
        line = json.loads(raw_data)
        with open('my_tweets.txt','a') as f:
            json.dump(line, f)
        return True

    def on_error(self, status_code):
        print status_code
        return True # Don't kill the stream

    def on_timeout(self):
        print 'Timeout...'
        return True # Don't kill the stream


if __name__ == '__main__':

    my_listener = CustomStreamListener()
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
    sapi = tweepy.streaming.Stream(auth, my_listener)  

    # locations is given as East, North GPS of lower left coordinate and East, North of upper right GPS. This filters tweets only with geotags
    sapi.filter(locations=[-180,-90,180,90])

Write a Reply or Comment

Your email address will not be published.