Web APIs and NLP

Background

Is there a fundamental difference between fans of rival teams?

If you're reading this and you follow sports, odds are you have a favourite team.

That team most likely also has a rival. Depending on how passionately you feel about your favourite team, there's a good chance you have quite a bit of disdain for that rival.

After all, that's part of the deal, right?

For some, this might seem arbitrary. For others, not so much.

Bill Shankly, Manager, Liverpool FC, 1959-1974
The Guardian

Underlying all of this is the belief that there are irreconcilable differences between each side and that those differences are fundamental.

Sports rivalries present us with the classic us versus them mentality:

It's a well-known principle in social psychology that people define themselves in terms of social groupings and are quick to denigrate others who don't fit into those groups. Others who share our particular qualities are our "ingroup," and those who do not are the "outgroup."

If that's true, that there is a fundamental difference between rival fans, then it should be easy enough for a third party, in this instance an NLP model, to determine which Reddit submissions come from each set of fans or supporters.

VIEW ON GITHUB

Datasets:

Custom Reddit submissions dataset created using Pushshift’s API totaling 57,908 posts.

Libraries:

bs4, matplotlib, nltk, pandas, random, re, requests, seaborn, sklearn, time, and wordcloud.

Challenge

Is it possible to create a binary classification model that can distinguish between posts in rival subreddits with greater than 90% accuracy?

To see if this is true, I have chosen two rival English Premier League clubs.

I didn't, however, want to make this easy.

If I had taken rivals from different parts of the country, differences could be explained away by other things, like regional dialects.

Instead, I have chosen Liverpool...

Jordan Henderson, Mo Salah, and Roberto Firmino celebrating yet another Liverpool goal.

...and Everton

Richarlison contemplating the fact that there's a very real chance Everton will be relegated.

Two clubs from the same city whose stadiums are one mile apart.

While I can't come to any definitive conclusions, I have decided to select a 90% accuracy rate as my target to see whether this theory has any legs.

Why 90%?

Why not?

Oh, and in the interest of full disclosure, I'm a Liverpool supporter.

Here’s a photo of me at Anfield (Liverpool’s stadium) back in 2005 touching the This is Anfield sign:

Initial Considerations

Challenges:

  • The obvious: r/LiverpoolFC and r/Everton are football (soccer) subreddits, so naturally there will be overlap in how they speak about certain things.
  • The annoying (for my model): Liverpool and Everton are both based in Liverpool, meaning there are not any easily detectable regional dialect differences.
  • The hilarious (to me as a Liverpool supporter): There is a night and day difference between total Liverpool and Everton supporters (Liverpool's subreddit has 337k members while Everton's has 30k), creating the potential for data imbalance.

The Dataset:

  1. Reddit submissions data acquired using Pushshift's API.
  2. 57,908 posts it total.
  3. 29,010 from r/LiverpoolFC.
  4. 28,889 from r/Everton.
  5. Initial features were subreddit, selftext, and title. Selftext is the body of a Reddit post that doesn't like to anywhere outside of Reddit.

Data Cleaning & EDA

Outside of removing duplicate rows and imputing some values, initial data cleaning was minimal, because I wanted to get a feel for out-of-the-box model performance.

After initial subpar results, I took a route and branch approach and engineered several variables in an attempt to improve the accuracy of my models.

Those features were:

  • tokens (tokenized forms of sentences)
  • lems (lemmarized versions of tokens)
  • porter_stems and snowball_stems (stemmed versions of tokens)
  • word_count (count of total words)
  • title_char_count (count of all characters in a title)
  • sentence_count (count of total sentences)
  • avg_word_length (average word length)
  • avg_sentence_length (average sentence length)
  • sentiment (a score giving to a piece of writing to determine whether it's positive, negative, or neutral)

Here’s a look at some of the most common words in each subreddit:

Nothing unusual here.
Interestingly, Liverpool is one of the most common words in the Everton subreddit.

Model Evaluation

All my models were overfit, but some were less overfit than others.
All the people here (Jürgen Klopp, Mo Salah, Trent Alexander-Arnold, Diogo Jota, Thiago, and Alisson) are still employed by Liverpool.
Ross Barkley, Carlo Ancelotti, Wayne Rooney, Marco Silva, Ronald Koeman, and Romelu Lukaku are no longer employed by Everton.

I ended up selecting my logistic regression model as my model of choice. It not only had the highest accuracy score (87.96% to the 87.49% of my random forest model), but it also had the lowest variance.

Conclusions

Due to my hunger for data, like Icarus, I flew too close to the sun. (Jacob Peter Gowy's The Flight of Icarus)

This was also me:

While it's obvious from the table above that overfitting was an issue with all of my models, what isn't so obvious is the additional constraint I put myself under, too much data.

On the surface, 57,000 observations may not seem like a lot of data, but, for the processing power I had available to me, it proved to be rather difficult to finetune the parameters of my models. They simply took too long to run.

Overall takeaways:

  1. If there is something inherently and fundamentally different about rival supporters, it isn't easily found with a classification model (remember I decided that there would need to be a 90% accuracy rate to give this theory some legs).
  2. Listen to Sumit, your instructor. While he didn’t initially come out and say it, he hinted that the amount of data I was collecting might prove difficult to process at speed given my hardware. Like Icarus, I flew too close to the sun and was left sweating down to the wire.
  3. If you don’t listen to Sumit, make sure you have a lot of processing power.

Oh, how I wish I knew of the existence of Google Colab before this project. 😂