Implications for Political Classification Models and Behavioral Studies
Kenan Alkiek · Follow
9 min read · Aug 31, 2022
--
Online communities are active spaces for political discussions and cross-community engagement. Researchers study these spaces to understand how these political spaces, and their users, influence real-life politics, forecast future political outcomes, increase political engagement offline, and even polarize opinions. A core assumption behind such studies is that users’ political affiliations can be reliably identified. However, gauging political leanings is a complex task, particularly for centrist users who infrequently express political beliefs. It’s easy to know who ILoveTrump is voting for, but most users are not so explicit. This data gap makes it hard to conduct large, comprehensive studies on political behavior. As a result, substantial work has focused on inferring affiliation.
To infer political affiliation, or conduct any behavioral study for that matter, requires selecting a sample of political users from Reddit. However, these samples frequently come from a single, often narrow, source — such as users who display political flairs. This practice raises important questions: Do these flair-displaying users accurately represent Reddit’s diverse political landscape? Would behavioral outcomes differ if we sampled from a broader set of users, like those who discuss politics in comments but don’t use flairs? How well do political classifiers work on different political groups?
- Political classifiers are imprecise at best. While some users, like “Hillary4Prez” clearly signal their politics, most don’t. Yet, these models often make overly confident predictions.
- The way you sample “political users” greatly impacts the results of behavioral studies. Studies should keep this in mind moving forward and sample more broadly.
- There are bad actors who pretend to be a part of both political parties and act as provocateurs. They are often the most active and controversial users. If not accounted for, they can throw a wrench into both behavioral studies and classification models.
- Political users are more toxic than others, and discussions in political subreddits are more toxic overall.
The dataset was collected from Reddit and consists of all English comments from December 2005 until December 2019. You can read the full research paper here. Undoubtedly, there are ethical concerns with predicting political affiliations. For example, a user may be mislabeled and treated poorly for their supposed beliefs. A crucial takeaway of this study is that inference models offer moderate performance at best and are likely to be unreliable in practice. We hope that the lack of a generalizable model deters the future use of inference models on Reddit. Our work aims to highlight the issue of inaccurately labeled and biased datasets in computational social science research, which is often inequitably felt in downstream harms.
- Defining Political Affiliation
- Classifying Political Affiliation
– The models
– Classification results - Characterizing Political Behavior
– Are political users more toxic?
– What drives toxicity?
– Finding the trolls - Conclusion and Future Work
Political affiliation is a complex description based on a person’s values and special interests. While some studies have attempted to infer continuous values along a spectrum, by and large, binary labels have been used i.e. conservative vs. liberal. We adopted the binary conservative and liberal because the focus of this particular study is on U.S. politics, and because we wanted to test the assumptions of prior work.
We used these three sources of political users because they have been used in previous behavioral studies (they each use one). We wanted to know if their studies generalize outside the users they ran their experiments on.
- Flaired Users — Some political subreddits allow users to display a flair next to their username. For example, a user commenting in the r/Conservative subreddit may select a “Reagan Republican” or “Trump Supporter” flair, both indicating a conservative political leaning
- Self-Declarations — Users who declare their politics in comments. E.g. “I only vote Democrat”. We used a select number of regex patterns and validated the results post collection
- Community Membership — Participating in political subreddits can serve as an implicit signal of affiliation. Reddit has multiple communities associated with political ideologies. For example, if a user frequently comments in r/Conservative, they can be assigned a conservative label. We removed users who posted in multiple communities across the political spectrum and excluded quasi-political subreddits like r/The_Donald
In total, we identified 573,829 political users. Here’s the breakdown by source.
Community membership is the largest source of political users and is primarily composed of conservatives. Given Reddit’s reputation for liberal bias, this skew has an important implication for downstream studies of these users alone. We also noticed that nearly half the users signal their politics in a single manner, which suggests that these sources of ground truth are distinct. In other words, these different sources of information offer complementary ways of recognizing beliefs.
Given these three sources of ground truth, can we predict a user’s political affiliation?
There have been many attempts to classify political behavior online and on Reddit in particular. The problem is that they rely on a single source of information as the ground truth (e.g. flaired users) which may not represent the political parties being modeled. Our goal was to test a variety of classifiers and see how well each method works on different user groups.
The models
We simplified the problem into a binary classification task (i.e. conservative or liberal) and selected three classification models to test.
Username Classifier
Usernames can reveal aspects of identity, e.g., Hillary4Prez reveals a liberal-leaning. To predict affiliation from names, we trained a bidirectional character-based LSTM.
Text Classifier
Certain topics, like gun rights or environmental issues, can give hints about a user’s political stance. To figure this out from what users say, we trained a RoBERTa model on their comments. We left out any comments where they directly state their political views.
Behavioral Classifier
User actions can give us good clues about their political leanings, especially when they take part in political or related communities, like those focused on environmental issues or gun rights. We created a model that looks at how users interact across different subreddits. The behavioral model picks up on how users engage with politically oriented communities, even if they never openly state their views in comments. In contrast, the text-based model focuses on what users actually say to reveal their politics. So, while the behavioral model looks at the groups (like subreddits) users join to get a sense of their political leaning, the text model relies on the words they use.
Classification results
Overall, the text classifier performed the best with an AUC score of 60.63 on all of the data.
In short, none of the models generalized. There is a large drop-off in scores between models trained on one set of users but tested on another. Which indicates that these users simply behave differently.
There are also other reasons why predicting affiliation broadly is so difficult:
- Users in the center who do not strongly align themselves with a single-party
- Apolitical users who mention their politics in passing, and primarily use Reddit for other purposes like looking at pictures of cats
- The binary classification mold does not embody many users. For example, a user may be socially conservative but economically liberal
So if none of the models generalize, what makes each set of political users so different?
Do users who declare their political beliefs in different ways also behave differently?
Political users have not been around long
We checked the age of users’ accounts by looking at the time between their first and last comments. We found that non-political users generally had accounts that were almost twice as old as those of political users. Among political users, conservatives were the most short-lived, with a median account lifespan a full year shorter than liberals.
There are bubbles but also cross-affiliated communities
We know that conservatives and liberals often stick to their own separate online circles. So, we decided to test whether smaller sub-groups within these larger political categories also tend to form their own isolated bubbles.
- As you might expect, we found that some conservative and liberal users stick mostly to their own political groups on Reddit, creating echo chambers. But we also found some mixed groups where conservatives and liberals interact regularly, showing that Reddit isn’t completely divided along political lines.
- We also discovered smaller, more isolated communities where people don’t often interact with others who share their political views. How information spreads seems to vary depending on what type of political user you’re looking at.
Are political users more toxic?
People often get heated when talking about politics online, and these discussions tend to be more confrontational and aggressive than conversations on non-political topics. This might be because politics have become closely tied to personal identity, so an attack on someone’s political views feels like a personal attack.
Reddit offers both specialized communities for political discussions and general spaces where anyone can talk about anything. We wanted to find out if conversations get more toxic because of the people involved or the topic itself. We defined “toxicity” as messages that include insults, threats, or offensive language. Here’s what we discovered. We found three key takeaways.
- Conversations in political forums on Reddit are generally more toxic, indicating that the subject matter itself often leads to increased hostility.
- Interestingly, users tend to be less toxic in communities where political flair is visible, even if the community has a mix of political affiliations.
- We saw distinct behavior patterns among three different types of Reddit users: those active in one-party political subreddits, those in flair-based communities, and those who openly declare their political stance in comments. The first group is notably more toxic, while the latter two are less so.
What drives toxicity?
We know that discussions in political communities are much more toxic. But what drives toxicity? To test for affiliation-based hostility, we constructed a mixed-effect linear regression model to estimate the toxicity of a reply to a comment.
The biggest factor influencing how toxic a comment will be is the comment it’s responding to. If the “parent” comment is toxic, the reply is much more likely to be toxic as well. The second biggest factor is whether the discussion is happening in a political subreddit. On the flip side, the most effective way to reduce toxicity was the visibility of political flair. When users have their political affiliation clearly displayed, they’re less likely to make aggressive comments.
Finding the trolls
Given the rise of trolls and other malicious actors on social media we wondered, are there users who are in both political camps? As it turns out, there are thousands of users who claim to be both Democrats and Republicans. Whether they are Russian trolls or bored teenagers we wanted to learn more.
The first thing we wanted to know was whether these shifts in political affiliation were genuine. While there’s no definitive way of knowing, we looked at the timing between changes in declared political parties. For example, if a user switches from a Republican to a Democrat flair within a week, we assume they’re not being sincere. We set a 90-day minimum time frame for considering a change in party affiliation to be genuine. Surprisingly, we found over 5,500 users who switched parties in less time than that.
On average they comment 266 times per month compared to 82 for all other political users and are particularly active in political subreddits.
Overall these users post more often, are more toxic, participate primarily in political subreddits, and have a 29% chance of their accounts being suspended or deleted.
Social media is rife with political activity, and studying these spaces requires accurately identifying who the political users are. The way we define what makes a user “political” has big implications for any subsequent analysis or models. Specifically, different types of “political users” behave differently, and a model that works for one group may not work for another.
Models, data, and code for this study can be found at https://github.com/davidjurgens/ reddit-political-affiliation. We have more cool results in the full paper.