Big Data, Python, Twitter

Mermaid Tweets?

A while ago, I scraped a week of geotagged tweets from the twitter API. I was surprised to see a large number of tweets being sent from what seems to be the ocean. Twitter allows you to select either a granular tag of the city you are in or, with your permission, to add your specific geographic coordinates. It does not easily allow you to add a geotag other than your real locations. I wanted to know, who are these (mer)people tweeting from the ocean!?

First, I thought I would check out the top hashtags and their occurrences:

Top Ocean Hashtags and Occurrences

Screen Shot 2015-06-19 at 1.29.20 PM

The Job / Jobs / Hiring hashtags were unsurprising. These hashtags are some of the most popular in the world, but especially the United States. (And of course, because the US tweets the most, our tweets dominate the overall landscape). “TKFanKshopGA” (appears to be a Korean Pop fan shop), “beach” and “istanbul” hashtags and increased map densities near land hinted that perhaps a majority of these tweets were simply coming from coastal regions with longitude and latitudes that did not fit into the country polygon shapes I used to determine the country of the geotagged tweets.

I used topic modeling on the tweet text to try to distinguish between mertweets and landtweets. I used an Latent Dirichlet Allocation (LDA) model on the text of the tweets and came up with an uninformative and normal seeming group of topics:

LDA of Country-Less Tweets:
  • “beach want rest late nature godisgreat beautiuful”
  • “km japan 45 earthquake ofunato magnitude 23”
  • “recife stanbul sisters autumn ukraine watermelon odessa”
  • “hiring jobs job tweetmyjobs wa tacomaft”
  • “new muyordenados purabelleza listeilor formaleseverywere look aiep”
  • “youre greece love world mykonos greekislands breeze”

It seemed like the LDA was mostly giving me coastal topics.

The ideal solution would be to exclude all tweets within a certain distance from a land border. Implementing that was a bit outside the time I allocated for this project, so I tried a different approach. I chose a semi-random rectangular section of ocean far from continents to see what these merpeople were tweeting about. Oddly enough most of them seemed to be talking like computers…

ocean_spot = data[(data.long> -200) & (data.long<-100) & (data.lat>-50) & (data.lat<0)]

Screen Shot 2015-06-19 at 11.57.59 AM

I filtered out tweet text with spaces, emojis and starting with “http”, to give me mostly robot tweets:

def space(texts):
    boo = []
    for text in texts:
        try:
            if re.search('\s', text) or text.encode('utf-8').startswith(('http', '\\')):
                boo.append(False)
            else:
                boo.append(True)
        except:
            boo.append(False)

    return boo

no_space_ocean = ocean_spot[space(ocean_spot['text'])]
no_space_all_ocean = unk_tweets[space(unk_tweets['text'])]
LDA of robot only tweets from the ocean spot:

Topic 0: be61d481eaf8fd7dde8d5c444bb213171d3710188ca1b1477aaba54365268c1edadb50d483d73436ea0e3f1cff4303a4ad6852458d98e2cf646c46b7a4a201a226f8de000000 62e466a8dc818c57ce2b361843f57df33e441ec7f89362265fe3117d8803e79be6617361fbf0226eb5bd99db66b78362ca0ff36de56210480453cf7d2fad0144d7ef80000000 17c67080d1fb0c459425d7ae89b7c6325563ed01ac38e8473fdce76b61f843986e443e8975ae3160ccc77b3afaf4221911abe5c1888181bd04b249f1d53301ac25f73e000000 73700a5ff2145e47fe4362f4fb8656e4fa2f24dc3e0e09681e6611e903e2b514474a6675d34426c214e36a14b42941f90dea457fbe414b49dd7696020444019b2a885b000000 301e69d14b6c0d0ee28011fb77aed5e6b0eaa1f6f24873081b72e06582ba1fe9fabb6b72741714a43b4af7056ebcd900864630520b51239d2e4899e0913c01a5f6135c000000 50506511cf0f0bf4e9cebf033c84eb3b219afc6d5fc0b6b5b88a2fbf4717adec892c351007f3daa2b574a79bc6790cad15c5492f3b7d3b6085e6355008bf014a78a9b8000000 071adf507636e45fb47d0aa6ef56afd34da7a2c5975938a088f8ac74254d088a27e89854b5b03dd50ebc9bd03530db62b75f9bd824571f340727393e0ee0014ab4b29a000000 Topic 1: 350a1e13a090486b4d7a78385b713b6165dd92e25c8861393dda028ecbf567bef9c6b7e52d5f814a8698a3dec27a643c59ff8e24fa7765be0f6ec99d7e6501e2802bfe000000 1fab9fa9b8345b1fff33353182f4cdfe45c0d9d4c85e90276a103bd3fad7e1e1bd2c75703e2710d49220fbce371c6d84433747d67b2050259cfa7da2f843016a6ac9d1000000 f433792b98583febf174f4148ed8f6e77cff5731a44ab78f6b35a144548b3ba9b1fae0558e001740d7256b14d95debecf9ca515896fff7c4579f06cdd9560165fddcdb000000 d3f61f075d2a10191ca53859913dcc23eb4bd1d852b5f8b92b31418fb59a5d0f0aa1effb9c58240db918dc8ad28b4317de60468a4f2f5141fb651c33d9c5012394c143000000 d5b74e28ad0dd3b647ad20aedd6555ae109a715e066be7455c90ee249a7e4b834634e0c76eccf01113f213da000de00cf6a5a9c2d67809b1cc90bccb5f680180e5ed73000000 b29f6736c18760a2e33d9693439f12125e3ae97980062cddfb9db21802bb35cae5e78ea0cf5606c98896390a4607825147fdfa96172c1b43ebf08453693401dccadfd9000000 4565c7c04b18a3ccaf1f358463765679c1482d96b1124419c7e37c019b825f909880d4cdd9a99b9774a0b9514a6b259ebe70a03f3353d98396a61bf61fbe01751ad76e000000

LDA of non-robot tweets from ocean spot:
  • “life time lol water hair thought tb”
  • “snatch working weaknesses crossfit fitness good church”
  • “bora borabora resort seasons honeymoon diving bestplace”
  • “sofitel borabora sofitelambassadors frenchpolynesia captain shuttle team”
  • “polynesia whos new think national franaise solidarity”

Many had been tagged as having a specific human language. Twitter determines the meta data language of a tweet using a language detection algorithm, so it is unclear how it assigned languages to a string of letters and numbers. ocean_lang_pi

1.95% of the country-less tweets contained no spaces (and therefore mostly contain these robot strings). This indicated that the majority of country-less tweets appear to be human-scribed. Because my LDA gave mostly coastal topics, it seems like most of these human scribed tweets are coming from coastal regions excluded from the country polygon fitting. 57.6% of the tweets from my random ocean spot contained robot messages. The other 42.4% of the random ocean spot tweets appeared to be related to islands– Tonga and Tahiti. This confirmed that in non-coastal or island regions, these mermaid tweets were being sent by robots.

To further validate my theory, I plotted all non-robot country-less tweets. nonrobot_ocean Compared to the robot country-less tweets, it seems clear that most of the non-coastal tweets are robots! robot_ocean
I was curious where else these robots are tweeting from, so I plotted robot tweets from my entire data set… (keeping in mind I’m defining a probable-robot-tweet as a tweet with no space, not starting with http or an emoji) robot_everywhere
Finally, compare the robot tweet density to the plot of all geotagged tweets in the world:all_tweets

What are they communicating? By investigating some accounts that sent these messages, I discovered that many of them were tweeting constantly, some every minute. One had been around since 2011, GooGuns since 2009.

I did some investigating of GooGuns, and there were many conspiratorial articles about it. I couldn’t find any reliable information on what it is or what it’s doing, the best I found was this y_combinator thread hypothesizing about information emitted by the bot, which apparently tweets from all over the globe at regular intervals. Perhaps some of my crypto friends will be interested in playing with the data. 🙂

Write a Reply or Comment

Your email address will not be published.