Brain Barf, Cooper Union, Data Science

On data science and how I see its relationship to art, philosophy, and my experience as an engineering student at Cooper Union, Part 1

As I hurdle towards the inevitable – my college graduation in May 2017 – I’ve been thinking a lot about the purpose of my education, its place in the greater context of my life, and the way it intersects with my place in the world.

I came to Cooper Union essentially by luck, driven by the same things as most high-achieving students proficient in math and science my age – grades, class rank, the perception that those above me in the hierarchies of family and scholarship knew what was best, that the ultimate purpose was to be the Best, do the Best, whatever the hell that even meant. So I ended up applying to engineering school because it fit the persona I felt I was “supposed” to be (smart, useful, wealthy) and thought I wanted to be. Cooper was free, and falling in line with my tendency towards risk-aversion (failure means weakness, and weakness is unacceptable!), that too seemed like the Right thing to do. So I went to Cooper Union.

Like many of my classmates, I was intoxicated by the idea that STEM was a superior set of fields – we were “smart”, we got good grades (the ultimate validation), we enlightened beings understood that the only correct way to look at the world was through the eyes of logic, reason, and rationality. We were – ironically enough – objective zealots.

Except it isn’t in the slightest. Engineering in the way that it’s been presented to me by many of my professors and peers – an overwhelming series of theory-dense courses that reward rote memorization and the ability to perform well under arbitrary pressure, is anything but superior. Like the education of my early and formative years did, it shapes directionless students into 4.0-hungry followers and suppresses recalcitrance and stifles original thought. Of course it does – most of them(us?) have been raised with similar value systems that we swallowed without question – most of them don’t seem to have thoughts outside their field of study or quest for some nebulous sense of ‘success.’

The only time I’ve found myself to be truly happy/thoughtful in my time at Cooper – and I’m not talking about the spikes of adrenaline that accompany the feeling of checking my semester grades – is when I struggle to make sense of something, only to come to the realization that my original perception of the concept or idea in question is missing something. Some examples I can pinpoint: figuring out how cell towers work or experiencing critiques in the class that utterly upended my life.

Last semester I took an art class that challenged the way I saw myself in the classroom setting and totally altered my perceptions of what it means to have “a successful education.” For the first time, I was surrounded by people (artists) who all seemed to have passions and practices that drove their educations, instead of vice versa. There was no ‘right’ answer to find in the solutions manual; the point was not to smile and speak up in class and do the assignments so that the professor would like me and give me an A or a glowing recommendation so that I could get a job and make lots of money and retire in a house with a garage and some dogs. As someone incredibly comforted by following the rules and the paths of other people to avoid discomfort and failure, this class was a shock to my system.

For the first time in my life, I was forced to think for myself. “Bullshitting” a project, as my engineering peers call the execution of an assignment with the minimal amount of work and receiving a stellar final grade, was not a badge of honor anymore. Because making art isn’t about the grade you get when you present a finished work at critique. It’s about how a thought process is explored and questioned and expressed and critiqued, but it’s also about the fact that nothing is ever finished or answered. I assert these things about art, but to be completely honest, my  experiences and perceptions of it are constantly changing but will never reach a deterministic truth. It’s exhilarating.

To be continued…

Brain Barf, Cooper Union, Data Science, Project Description

2017 is the year of realizing things: some half-baked project ideas…

kylie_jenner = pandas.DataFrame(['KUWTK', 'Kylie Kosmetics', 'Tyga4ever', 'why is there a python in this picture?'])

Kylie Jenner said that 2016 was the year of realizing things, but I’d bet Cooper Union’s sticker price (TOO SOON) that she wasn’t referring to the illuminating experience of learning Python for fun over winter break. Yes, I realize that Kylie has lip kits and white Ferraris to focus on, but girl should check out pandas dataframes if she really wants to live.

In an attempt to get over myself and the resulting self-doubt and stubbornness that made me think I wasn’t capable of programming and therefore terrified of failing at it, I spent the last three weeks crash-coursing myself in Python and all of its very awesomely intuitive data science packages.

EdX is great – check out some of their ‘Python for Data Science’ courses if you’re trying to teach yourself to code and have some solid self-discipline to keep you going.

Now that I’m proficient in numpy, pandas, matplotlib, and scikit-learn, I’ve seen the light that is data manipulation/machine learning with Python and have all the regrets that I tried to do all my Statistical Learning assignments last semester in MATLAB. *shudders*

This is cool. Now I should make some cool things that attempt to answer some cool questions.

So if you read my aptly-titled ‘Brain Barf’ post, you know that I have all the feels about doing projects that fulfill my arbitrary standard of what is valuable and useful. Are those feels (and that post) just my thinly-veiled insecurities about never being good enough? Probably. Like I said, working on getting over myself.

I’ve come to terms with the fact that right now it is most valuable for me to practice my skills on projects that challenge the way that I think; doing significant things that change the world will come later when my skill level and mental elasticity get there.

So right now, I’m planning on doing projects related to some questions that I’ve jotted down in my notes recently:

What will happen if Congress defunds Planned Parenthood?

Yes I realize that this is a massive question, but I’m curious about the relationships between maternal death rates, infant and fetal mortality, and crime rates, among other things, and how they’ve changed since Planned Parenthood started offering abortion services in 1970. I also wonder if there’s a significant difference in the trends of graduation rates, suicide rates, and quality of life over that period of time between the biological sexes (namely the male and female sexes, as intersex data is largely unavailable).

The hardest part of this project will probably be the data collection. Some of the features that I’m interested in analyzing are readily available in nice clean datasets, but many (including some of the features I have yet to think of), are not.

What type of brown ale should we brew next?

For those of you following along at home, some important context:

  • I’m a senior studying electrical engineering at Cooper Union (More About Me!)
  • I helped start an interdisciplinary independent study in beer brewing last semester.
  • We brewed some delicious stouts (milk and imperial), an IPA (session), a blonde ale, and a brown ale.

This semester, we’re continuing to brew for fun even though there aren’t credits involved, and we’re trying to refine our process and clone our favorite beers.

BeerAdvocate, a noted beer review website that we use for reference, has reviews for 2677 different brown ales alone. As tempting as it may be, it’s not feasible for our class of 5 people to try 2677 different brown ales before deciding which one to clone.

dd1b67d80680d1b66b6dd9b904bc881c

Enter data science! My brewing professor found a gigabyte worth of scraped beer reviews (YASSSSS I don’t have to deal with scraping!!!) that I can do some text analysis on. The preliminary plan is to look for themes in the reviews and determine the ones that match with our class’s verbal description of the type of brown ale we’d like to brew.

Anyhoo, that’s all I got for now. Will keep y’all posted with project updates and intermittent existential crises!
Cooper Union, Data Science, Project Description, Social Justice, Techy Babble

import #SexualViolence from MachineLearning.LDA

This blog post was originally the author’s writeup for a project she did with her classmate Jason Katz for ECE411: Statistical Learning during the Fall 2016 semester at Cooper Union.

Lots of surreal things happened in 2016…

Harambe, Prince, Muhammad Ali, David Bowie, Harper Lee, Leonard Cohen, Gene Wilder, Alan Rickman, and Zsa Zsa Gabor left us. Brexit happened. Donald Trump got elected president. Flint, Michigan STILL doesn’t have clean water. You get my point.

Fueled by Donald Trump’s misogynistic statements, the many women who have accused him of sexual assault, and his own admission of sexual assault that went viral during the 2016 presidential race, the topic of sexual violence has situated itself at the forefront of national conversation among politicians, pundits, and prominent feminist writers alike. This is just a small segment of the much larger discussion about affirmative consent, Title IX, the rape crisis on college campuses, victim blaming, and slut shaming that has been going on for years.

Ok, but what does Twitter have anything to do with this?

On one hand, it’s great – albeit REALLY overdue – that the news media finally recognizes sexual violence is an issue that warrants substantial airtime. But basically everything on TV that deals with rape culture or misogyny has been neatly packaged into an emotionally flat or intentionally misconstrued block of information to fit into the two second space allotted to it. Because profit.

Screen Shot 2016-12-20 at 3.13.05 PM.png

Case in point.

Twitter, on the other hand, is a gold mine of publicly available, unfiltered visceral reactions. For many survivors of sexual violence, it’s a great outlet for self-therapy in the form of all caps vent rants but also empowerment via an informal support network of people who also tweet about their experiences with similar trauma.

Coming at it from the perspective of a DSWFSAGJFVOSV (data scientist who feels strongly about getting justice for victims of sexual violence), I had a hunch that “survivor Twitter” might have a lot to tell us. 

the techy stuff

Inspired by the hashtags #RapedAtSpelman and #RapedAtMorehouse, which went viral in response to an anonymous account set up by a rape survivor at Spelman College, we used Twitter’s REST API to scrape trending hashtags related to sexual violence and promoted by activists who are sympathetic towards survivors and survivor justice, including:

#whywomendontreport

#notokay

#whyistayed

#rapecultureiswhen

#webelieveyou

#tilithappenstoyou

#askingforit

#thehuntingground

#rapeculture

#listentosurvivors

#thingslongerthanbrockturnersrapesentence

#endrapeculture

#stanfordrapist

#sexualviolence

#EmilyDoe

You get it?

Twitter does this annoying thing where it only lets you scrape a certain number of tweets from the past week, so we had to run our scraping script every couple of days to amass a sufficient collection of tweets.

My fabulous partner did a ton (read: 25 hours) of pre-processing to turn the tweets into parseable text, which we then fed into a latent Dirichlet allocation (LDA) code base.

LDA for the Layperson (CUTE HAMSTERS AHEAD!)

LDA is a well-known method of topic modeling, which in machine learning (ML) and natural language processing (NLP) refers to a statistical model that discovers ‘topics’ that characterize a collection of documents, or in this case, tweets. According to LDA, a document is merely a collection of topics where each topic has some probability of generating a specific word.

Suppose you have a bunch of sentences (or tweets).

  • I am tired because I have a lot of finals this week.
  • Is my finals-related sleep deprivation apparent yet?
  • Sometimes I go to animal shelters and pet kittens to avoid my responsibilities.
  • Baby animal videos are best when they include kittens.
  • Look at this cute hamster munching on a piece of bok choy that I found while procrastinating my stat learning blog post assignment.

If we give our LDA these sentences and ask it to determine 2 topics, LDA might produce something like this:

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% sleep, 15% tired, 10% finals, 10% deprivation, … (at which point you could call topic A finals week)
  • Topic B: 20% animal, 20% kittens, 20% cute, 15% hamster, … (at which point you could call topic B cute animals)

LDA essentially represents documents tweets as mixtures of topics that spit out words with certain probabilities. It assumes that the corpus of documents tweets is produced as follows:

  • decide on the number of words N the document will have
  • choose a topic mixture for the document tweet depending on the number of topics the user (us!) asks for
  • generate each word in the document tweet by
    • picking a topic
    • then using the topic to generate the word from the topic’s set of words

From here, LDA does fancy math to determine a set of latent topics that is likely to have generated the corpus of documents tweets. Our code in particular outputs the top n words from each topic that have the highest probability of being “chosen.”

LDA is cool. Be like LDA.

Screen Shot 2016-12-20 at 3.14.48 PM.png

This is confusing so maybe this picture will help.

We played around with the number of topics and number of top words per topic. When our code resulted in useless words like ‘day’ and ‘introduced,’ we added them to our dictionary of NLTK stop words. By stop words, we mean common words that are filtered out before NLP happens because they are meaningless in the context of topic modeling.

So….what happened?

We got the best results (read: most distinct categories) when we fit LDA models with 3 topics and asked for the top 10 words in each topic:

Screen Shot 2016-12-20 at 3.16.01 PM.png

I’d like to direct your attention to a couple of interesting things that popped up.


Topic #1: Self-Proclaimed Deplorables

screen-shot-2016-12-20-at-3-00-35-pm

America, MAGA, norefugees, obama, nightmares, deport, and DREAMers all point to a sphere of Twitter inhabited by Donald Trump supporters who don’t like the current political landscape of the United States and dislike immigrants and refugees. @lrihendry, in particular, refers to popular alt-right Twitter user Lori Hendry (check out the Pepe the frog emoji in her profile), and qgkmkkhrxu refers to the image name of a heavily retweeted xenophobic, anti-muslim meme that she tagged with #rapeCulture.

screen-shot-2016-12-19-at-11-20-08-pmscreen-shot-2016-12-19-at-11-19-58-pm

Lori kind of sucks. But she did retweet this vine so at least she’s got that going for her.

It’s a bit disconcerting that our scraping picked up enough tweets from ironic coopters of the hashtags #rapeculture and #notokay for LDA to give them their own separate category of tweets.

Many of these tweets were retweets of @lrihendry meme, hence the top words in the category, but there were a few other tweets that referenced the “double standard of rape culture” and “toxic femininity,” a riff on the popular hashtag #toxicmasculinity.


Topic #2: Sexual Violence and Twiplomacy

screen-shot-2016-12-20-at-3-00-42-pm

Schools, repspeier, transfer, and safe refer to The Safe Transfer Act, proposed two weeks ago by California Representative Jackie Speier, which requires notation on the academic transcript of any student found by their college or university to have violated the school’s rules or policies with regards to sexual violence. Despite being a follower of many prominent feminist Twitter accounts, I had no knowledge of this piece of legislation nor of Representative Speier prior to this project.

screen-shot-2016-12-20-at-12-03-57-amscreen-shot-2016-12-20-at-12-03-43-am

Heartbeatbill, man, domestic, and violence come from Twitter user @femtheologian Dr. Gina Messina, a professor and Huffington Post blogger who expressed her anger with a proposed fetal heartbeat bill in Ohio. Domestic violence, in particular, relates to her tweet about Ohio State Senator Kris Jordan, who Messina claims has a history of domestic violence.

screen-shot-2016-12-19-at-11-54-51-pmscreen-shot-2016-12-19-at-11-54-17-pmIn these cases, LDA teased out tweets specifically related to publicly contentious pieces of legislation related to Title IX and sexual violence.


Topic #3: “@kellyoxford: tweet me your first assaults. they aren’t just stats.”

screen-shot-2016-12-20-at-3-00-50-pm

In this category, @kellyoxford, sexualassault, rape, consent, and survivors clearly relate to Twitter user Kelly Oxford, the New York Times bestselling author whose #notokay tweet that encouraged women to tweet their experiences with sexual violence went viral.

screen-shot-2016-12-20-at-2-59-45-pm

Screen Shot 2016-12-20 at 3.04.53 PM.png

Screen Shot 2016-12-20 at 3.58.06 PM.png


@takedownmras refers to a Twitter account run by a group of men who actively disagree with men’s rights activists (MRAs) and work to support survivors of sexual violence. Similar to the accounts that popped up in the last topic, I had no prior knowledge of @takedownmras, but am very glad to have found them.

Screen Shot 2016-12-20 at 4.52.39 PM.png

This category affirms our initial claim that people who have experienced sexual assault use Twitter as a way to form a mutual support network with other survivors. The results are promising; machine learning algorithms may very well be a good way to facilitate more of these connections.

Now what?

This project was only a brief foray into the world of scraping and analyzing Twitter data.

We were under the impression that we understood “survivor Twitter.” Machine learning in the form of LDA helped us take viral hashtags we were familiar with and derive totally new knowledge from them, namely a parallel use of the viral hashtag #rapeCulture and our discovery of politicians for whom survivor justice is a priority.

Given our experiences (and interesting results!), we think there is definitely more to explore in the intersection of Twitter and sexual violence activism. One potential application of LDA is to use it to make it easier for survivors of sexual violence to find each other and the representatives who advocate for them, while avoiding people whose use of these hashtags might do more harm than good for survivors.