Is there a fundamental difference between fans of rival teams?
If you're reading this and you follow sports, odds are you have a favourite team.
That team most likely also has a rival. Depending on how passionately you feel about your favourite team, there's a good chance you have quite a bit of disdain for that rival.
After all, that's part of the deal, right?
For some, this might seem arbitrary. For others, not so much.
Underlying all of this is the belief that there are irreconcilable differences between each side and that those differences are fundamental.
Sports rivalries present us with the classic us versus them mentality:
If that's true, that there is a fundamental difference between rival fans, then it should be easy enough for a third party, in this instance an NLP model, to determine which Reddit submissions come from each set of fans or supporters.
Custom Reddit submissions dataset created using Pushshift’s API totaling 57,908 posts.
bs4, matplotlib, nltk, pandas, random, re, requests, seaborn, sklearn, time, and wordcloud.
To see if this is true, I have chosen two rival English Premier League clubs.
I didn't, however, want to make this easy.
If I had taken rivals from different parts of the country, differences could be explained away by other things, like regional dialects.
Instead, I have chosen Liverpool...
...and Everton
Two clubs from the same city whose stadiums are one mile apart.
While I can't come to any definitive conclusions, I have decided to select a 90% accuracy rate as my target to see whether this theory has any legs.
Why 90%?
Why not?
Oh, and in the interest of full disclosure, I'm a Liverpool supporter.
Here’s a photo of me at Anfield (Liverpool’s stadium) back in 2005 touching the This is Anfield sign:
Outside of removing duplicate rows and imputing some values, initial data cleaning was minimal, because I wanted to get a feel for out-of-the-box model performance.
After initial subpar results, I took a route and branch approach and engineered several variables in an attempt to improve the accuracy of my models.
Those features were:
Here’s a look at some of the most common words in each subreddit:
I ended up selecting my logistic regression model as my model of choice. It not only had the highest accuracy score (87.96% to the 87.49% of my random forest model), but it also had the lowest variance.
This was also me:
While it's obvious from the table above that overfitting was an issue with all of my models, what isn't so obvious is the additional constraint I put myself under, too much data.
On the surface, 57,000 observations may not seem like a lot of data, but, for the processing power I had available to me, it proved to be rather difficult to finetune the parameters of my models. They simply took too long to run.
Overall takeaways:
Oh, how I wish I knew of the existence of Google Colab before this project. 😂