This blog post was originally the author’s writeup for a project she did with her classmate Jason Katz for ECE411: Statistical Learning during the Fall 2016 semester at Cooper Union.
Lots of surreal things happened in 2016…
Harambe, Prince, Muhammad Ali, David Bowie, Harper Lee, Leonard Cohen, Gene Wilder, Alan Rickman, and Zsa Zsa Gabor left us. Brexit happened. Donald Trump got elected president. Flint, Michigan STILL doesn’t have clean water. You get my point.
Fueled by Donald Trump’s misogynistic statements, the many women who have accused him of sexual assault, and his own admission of sexual assault that went viral during the 2016 presidential race, the topic of sexual violence has situated itself at the forefront of national conversation among politicians, pundits, and prominent feminist writers alike. This is just a small segment of the much larger discussion about affirmative consent, Title IX, the rape crisis on college campuses, victim blaming, and slut shaming that has been going on for years.
Ok, but what does Twitter have anything to do with this?
On one hand, it’s great – albeit REALLY overdue – that the news media finally recognizes sexual violence is an issue that warrants substantial airtime. But basically everything on TV that deals with rape culture or misogyny has been neatly packaged into an emotionally flat or intentionally misconstrued block of information to fit into the two second space allotted to it. Because profit.
Case in point.
Twitter, on the other hand, is a gold mine of publicly available, unfiltered visceral reactions. For many survivors of sexual violence, it’s a great outlet for self-therapy in the form of all caps vent rants but also empowerment via an informal support network of people who also tweet about their experiences with similar trauma.
Coming at it from the perspective of a DSWFSAGJFVOSV (data scientist who feels strongly about getting justice for victims of sexual violence), I had a hunch that “survivor Twitter” might have a lot to tell us.
the techy stuff
Inspired by the hashtags #RapedAtSpelman and #RapedAtMorehouse, which went viral in response to an anonymous account set up by a rape survivor at Spelman College, we used Twitter’s REST API to scrape trending hashtags related to sexual violence and promoted by activists who are sympathetic towards survivors and survivor justice, including:
You get it?
Twitter does this annoying thing where it only lets you scrape a certain number of tweets from the past week, so we had to run our scraping script every couple of days to amass a sufficient collection of tweets.
My fabulous partner did a ton (read: 25 hours) of pre-processing to turn the tweets into parseable text, which we then fed into a latent Dirichlet allocation (LDA) code base.
LDA for the Layperson (CUTE HAMSTERS AHEAD!)
LDA is a well-known method of topic modeling, which in machine learning (ML) and natural language processing (NLP) refers to a statistical model that discovers ‘topics’ that characterize a collection of documents, or in this case, tweets. According to LDA, a document is merely a collection of topics where each topic has some probability of generating a specific word.
Suppose you have a bunch of sentences (or tweets).
- I am tired because I have a lot of finals this week.
- Is my finals-related sleep deprivation apparent yet?
- Sometimes I go to animal shelters and pet kittens to avoid my responsibilities.
- Baby animal videos are best when they include kittens.
- Look at this cute hamster munching on a piece of bok choy that I found while procrastinating my stat learning blog post assignment.
If we give our LDA these sentences and ask it to determine 2 topics, LDA might produce something like this:
- Sentences 1 and 2: 100% Topic A
- Sentences 3 and 4: 100% Topic B
- Sentence 5: 60% Topic A, 40% Topic B
- Topic A: 30% sleep, 15% tired, 10% finals, 10% deprivation, … (at which point you could call topic A finals week)
- Topic B: 20% animal, 20% kittens, 20% cute, 15% hamster, … (at which point you could call topic B cute animals)
LDA essentially represents
documents tweets as mixtures of topics that spit out words with certain probabilities. It assumes that the corpus of documents tweets is produced as follows:
- decide on the number of words N the document will have
- choose a topic mixture for the
document tweet depending on the number of topics the user (us!) asks for
- generate each word in the
document tweet by
- picking a topic
- then using the topic to generate the word from the topic’s set of words
From here, LDA does fancy math to determine a set of latent topics that is likely to have generated the corpus of
documents tweets. Our code in particular outputs the top n words from each topic that have the highest probability of being “chosen.”
LDA is cool. Be like LDA.
This is confusing so maybe this picture will help.
We played around with the number of topics and number of top words per topic. When our code resulted in useless words like ‘day’ and ‘introduced,’ we added them to our dictionary of NLTK stop words. By stop words, we mean common words that are filtered out before NLP happens because they are meaningless in the context of topic modeling.
We got the best results (read: most distinct categories) when we fit LDA models with 3 topics and asked for the top 10 words in each topic:
I’d like to direct your attention to a couple of interesting things that popped up.
Topic #1: Self-Proclaimed Deplorables
America, MAGA, norefugees, obama, nightmares, deport, and DREAMers all point to a sphere of Twitter inhabited by Donald Trump supporters who don’t like the current political landscape of the United States and dislike immigrants and refugees. @lrihendry, in particular, refers to popular alt-right Twitter user Lori Hendry (check out the Pepe the frog emoji in her profile), and qgkmkkhrxu refers to the image name of a heavily retweeted xenophobic, anti-muslim meme that she tagged with #rapeCulture.
Lori kind of sucks. But she did retweet this vine so at least she’s got that going for her.
It’s a bit disconcerting that our scraping picked up enough tweets from ironic coopters of the hashtags #rapeculture and #notokay for LDA to give them their own separate category of tweets.
Many of these tweets were retweets of @lrihendry meme, hence the top words in the category, but there were a few other tweets that referenced the “double standard of rape culture” and “toxic femininity,” a riff on the popular hashtag #toxicmasculinity.
Topic #2: Sexual Violence and Twiplomacy
Schools, repspeier, transfer, and safe refer to The Safe Transfer Act, proposed two weeks ago by California Representative Jackie Speier, which requires notation on the academic transcript of any student found by their college or university to have violated the school’s rules or policies with regards to sexual violence. Despite being a follower of many prominent feminist Twitter accounts, I had no knowledge of this piece of legislation nor of Representative Speier prior to this project.
Heartbeatbill, man, domestic, and violence come from Twitter user @femtheologian Dr. Gina Messina, a professor and Huffington Post blogger who expressed her anger with a proposed fetal heartbeat bill in Ohio. Domestic violence, in particular, relates to her tweet about Ohio State Senator Kris Jordan, who Messina claims has a history of domestic violence.
In these cases, LDA teased out tweets specifically related to publicly contentious pieces of legislation related to Title IX and sexual violence.
Topic #3: “@kellyoxford: tweet me your first assaults. they aren’t just stats.”
In this category, @kellyoxford, sexual, assault, rape, consent, and survivors clearly relate to Twitter user Kelly Oxford, the New York Times bestselling author whose #notokay tweet that encouraged women to tweet their experiences with sexual violence went viral.
@takedownmras refers to a Twitter account run by a group of men who actively disagree with men’s rights activists (MRAs) and work to support survivors of sexual violence. Similar to the accounts that popped up in the last topic, I had no prior knowledge of @takedownmras, but am very glad to have found them.
This category affirms our initial claim that people who have experienced sexual assault use Twitter as a way to form a mutual support network with other survivors. The results are promising; machine learning algorithms may very well be a good way to facilitate more of these connections.
This project was only a brief foray into the world of scraping and analyzing Twitter data.
We were under the impression that we understood “survivor Twitter.” Machine learning in the form of LDA helped us take viral hashtags we were familiar with and derive totally new knowledge from them, namely a parallel use of the viral hashtag #rapeCulture and our discovery of politicians for whom survivor justice is a priority.
Given our experiences (and interesting results!), we think there is definitely more to explore in the intersection of Twitter and sexual violence activism. One potential application of LDA is to use it to make it easier for survivors of sexual violence to find each other and the representatives who advocate for them, while avoiding people whose use of these hashtags might do more harm than good for survivors.