Data Science, Project Description, Social Justice, Techy Babble

LinkNYC – a project and also why aren’t laypeople given the chance to understand the Internet?

I’ve become keenly aware of New York City’s aggressive, aptly-named “LinkNYC” campaign to revamp old, pee-soaked telephone booths by making them into free public Wi-Fi hotspots featuring advertised speeds of up to 1 Gbps.

Prompted by his response to my audible “holy shit,” I explained the situation (in hindsight, using unnecessarily technical electrical engineering terms like data rate and fiber optics) to my boyfriend (civil engineer turned brilliant artist/designer/philosopher).

His response: “huh?” A quick text to my family confirmed that, no, this type of thing is not, in fact, common knowledge.

Then literally the next day, I realized that I’d been seeing gigantic banners on the subway every morning  that spelled out “1 GIG, THE INTERNET SPEED YOUR MOTHER WARNED YOU ABOUT.”

Whoa now copywriters (are those still the people who do ads? this is the one thing I remember from Mad Men), WATERYEWDEWING. I know the implications of that internet speed, and I DIDN’T EVEN NOTICE THE AD. How on earth is a person who hasn’t experienced the pleasures of communication theory, digital signal processing, and wireless communications courses supposed to have any idea of what that means or why they should care?

Isn’t it a bit strange that most of us spend a good portion of our time online, yet have nearly no understanding about what goes on beyond the router?

enter new project!

So thanks to that little revelation, aforementioned boyfriend and I are collaborating on the creation of an infographic that will bring some perspective to the initiative for the averagely techy New Yorker.


  1. Decide to communicate the meanings of Internet speeds
  2. Do this by researching the Internet speed requirements of comparative use cases (eg. Netflix vs web browsing)
  3. Decide to also communicate the impact of LinkNYC on the prevalence of public wifi hotspots
  4. Do this by locating* a map detailing where non-LinkNYC and LinkNYC public wifi hotspots are located
    • *Make a map that details where non-LinkNYC and LinkNYC public wifi hotspots are located using NYC open data
  5. Realize that the spread of LinkNYCs on a map looks eerily similar to a gentrification map you saw one time
  6. Decide to also communicate the potential for implications of LinkNYCs and other public wifi locations with respect to class and race disparity
  7. Do this by planning for lots of approximations and printing and drawing to determine the number of non-LinkNYC and LinkNYC public hotspots per neighborhood divided into categories based on average income bracket
  8. Drink some wine
  9. Decide to go home, write a blog post, and go to bed.

Stay tuned for our infographic!

Cooper Union, Data Science, Project Description, Social Justice, Techy Babble

import #SexualViolence from MachineLearning.LDA

This blog post was originally the author’s writeup for a project she did with her classmate Jason Katz for ECE411: Statistical Learning during the Fall 2016 semester at Cooper Union.

Lots of surreal things happened in 2016…

Harambe, Prince, Muhammad Ali, David Bowie, Harper Lee, Leonard Cohen, Gene Wilder, Alan Rickman, and Zsa Zsa Gabor left us. Brexit happened. Donald Trump got elected president. Flint, Michigan STILL doesn’t have clean water. You get my point.

Fueled by Donald Trump’s misogynistic statements, the many women who have accused him of sexual assault, and his own admission of sexual assault that went viral during the 2016 presidential race, the topic of sexual violence has situated itself at the forefront of national conversation among politicians, pundits, and prominent feminist writers alike. This is just a small segment of the much larger discussion about affirmative consent, Title IX, the rape crisis on college campuses, victim blaming, and slut shaming that has been going on for years.

Ok, but what does Twitter have anything to do with this?

On one hand, it’s great – albeit REALLY overdue – that the news media finally recognizes sexual violence is an issue that warrants substantial airtime. But basically everything on TV that deals with rape culture or misogyny has been neatly packaged into an emotionally flat or intentionally misconstrued block of information to fit into the two second space allotted to it. Because profit.

Screen Shot 2016-12-20 at 3.13.05 PM.png

Case in point.

Twitter, on the other hand, is a gold mine of publicly available, unfiltered visceral reactions. For many survivors of sexual violence, it’s a great outlet for self-therapy in the form of all caps vent rants but also empowerment via an informal support network of people who also tweet about their experiences with similar trauma.

Coming at it from the perspective of a DSWFSAGJFVOSV (data scientist who feels strongly about getting justice for victims of sexual violence), I had a hunch that “survivor Twitter” might have a lot to tell us. 

the techy stuff

Inspired by the hashtags #RapedAtSpelman and #RapedAtMorehouse, which went viral in response to an anonymous account set up by a rape survivor at Spelman College, we used Twitter’s REST API to scrape trending hashtags related to sexual violence and promoted by activists who are sympathetic towards survivors and survivor justice, including:
















You get it?

Twitter does this annoying thing where it only lets you scrape a certain number of tweets from the past week, so we had to run our scraping script every couple of days to amass a sufficient collection of tweets.

My fabulous partner did a ton (read: 25 hours) of pre-processing to turn the tweets into parseable text, which we then fed into a latent Dirichlet allocation (LDA) code base.

LDA for the Layperson (CUTE HAMSTERS AHEAD!)

LDA is a well-known method of topic modeling, which in machine learning (ML) and natural language processing (NLP) refers to a statistical model that discovers ‘topics’ that characterize a collection of documents, or in this case, tweets. According to LDA, a document is merely a collection of topics where each topic has some probability of generating a specific word.

Suppose you have a bunch of sentences (or tweets).

  • I am tired because I have a lot of finals this week.
  • Is my finals-related sleep deprivation apparent yet?
  • Sometimes I go to animal shelters and pet kittens to avoid my responsibilities.
  • Baby animal videos are best when they include kittens.
  • Look at this cute hamster munching on a piece of bok choy that I found while procrastinating my stat learning blog post assignment.

If we give our LDA these sentences and ask it to determine 2 topics, LDA might produce something like this:

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% sleep, 15% tired, 10% finals, 10% deprivation, … (at which point you could call topic A finals week)
  • Topic B: 20% animal, 20% kittens, 20% cute, 15% hamster, … (at which point you could call topic B cute animals)

LDA essentially represents documents tweets as mixtures of topics that spit out words with certain probabilities. It assumes that the corpus of documents tweets is produced as follows:

  • decide on the number of words N the document will have
  • choose a topic mixture for the document tweet depending on the number of topics the user (us!) asks for
  • generate each word in the document tweet by
    • picking a topic
    • then using the topic to generate the word from the topic’s set of words

From here, LDA does fancy math to determine a set of latent topics that is likely to have generated the corpus of documents tweets. Our code in particular outputs the top n words from each topic that have the highest probability of being “chosen.”

LDA is cool. Be like LDA.

Screen Shot 2016-12-20 at 3.14.48 PM.png

This is confusing so maybe this picture will help.

We played around with the number of topics and number of top words per topic. When our code resulted in useless words like ‘day’ and ‘introduced,’ we added them to our dictionary of NLTK stop words. By stop words, we mean common words that are filtered out before NLP happens because they are meaningless in the context of topic modeling.

So….what happened?

We got the best results (read: most distinct categories) when we fit LDA models with 3 topics and asked for the top 10 words in each topic:

Screen Shot 2016-12-20 at 3.16.01 PM.png

I’d like to direct your attention to a couple of interesting things that popped up.

Topic #1: Self-Proclaimed Deplorables


America, MAGA, norefugees, obama, nightmares, deport, and DREAMers all point to a sphere of Twitter inhabited by Donald Trump supporters who don’t like the current political landscape of the United States and dislike immigrants and refugees. @lrihendry, in particular, refers to popular alt-right Twitter user Lori Hendry (check out the Pepe the frog emoji in her profile), and qgkmkkhrxu refers to the image name of a heavily retweeted xenophobic, anti-muslim meme that she tagged with #rapeCulture.


Lori kind of sucks. But she did retweet this vine so at least she’s got that going for her.

It’s a bit disconcerting that our scraping picked up enough tweets from ironic coopters of the hashtags #rapeculture and #notokay for LDA to give them their own separate category of tweets.

Many of these tweets were retweets of @lrihendry meme, hence the top words in the category, but there were a few other tweets that referenced the “double standard of rape culture” and “toxic femininity,” a riff on the popular hashtag #toxicmasculinity.

Topic #2: Sexual Violence and Twiplomacy


Schools, repspeier, transfer, and safe refer to The Safe Transfer Act, proposed two weeks ago by California Representative Jackie Speier, which requires notation on the academic transcript of any student found by their college or university to have violated the school’s rules or policies with regards to sexual violence. Despite being a follower of many prominent feminist Twitter accounts, I had no knowledge of this piece of legislation nor of Representative Speier prior to this project.


Heartbeatbill, man, domestic, and violence come from Twitter user @femtheologian Dr. Gina Messina, a professor and Huffington Post blogger who expressed her anger with a proposed fetal heartbeat bill in Ohio. Domestic violence, in particular, relates to her tweet about Ohio State Senator Kris Jordan, who Messina claims has a history of domestic violence.

screen-shot-2016-12-19-at-11-54-51-pmscreen-shot-2016-12-19-at-11-54-17-pmIn these cases, LDA teased out tweets specifically related to publicly contentious pieces of legislation related to Title IX and sexual violence.

Topic #3: “@kellyoxford: tweet me your first assaults. they aren’t just stats.”


In this category, @kellyoxford, sexualassault, rape, consent, and survivors clearly relate to Twitter user Kelly Oxford, the New York Times bestselling author whose #notokay tweet that encouraged women to tweet their experiences with sexual violence went viral.


Screen Shot 2016-12-20 at 3.04.53 PM.png

Screen Shot 2016-12-20 at 3.58.06 PM.png

@takedownmras refers to a Twitter account run by a group of men who actively disagree with men’s rights activists (MRAs) and work to support survivors of sexual violence. Similar to the accounts that popped up in the last topic, I had no prior knowledge of @takedownmras, but am very glad to have found them.

Screen Shot 2016-12-20 at 4.52.39 PM.png

This category affirms our initial claim that people who have experienced sexual assault use Twitter as a way to form a mutual support network with other survivors. The results are promising; machine learning algorithms may very well be a good way to facilitate more of these connections.

Now what?

This project was only a brief foray into the world of scraping and analyzing Twitter data.

We were under the impression that we understood “survivor Twitter.” Machine learning in the form of LDA helped us take viral hashtags we were familiar with and derive totally new knowledge from them, namely a parallel use of the viral hashtag #rapeCulture and our discovery of politicians for whom survivor justice is a priority.

Given our experiences (and interesting results!), we think there is definitely more to explore in the intersection of Twitter and sexual violence activism. One potential application of LDA is to use it to make it easier for survivors of sexual violence to find each other and the representatives who advocate for them, while avoiding people whose use of these hashtags might do more harm than good for survivors.