Exploration: Ao3 Tag Analysis

Check out my GitHub for the whole project, code, and video presentation!

Note: This is an exploration of preliminary data! It is not intended to be read as definitive; more research and data work needs to be done to support these findings.

Abstract

In this project, I investigate what makes fanfiction unique as a genre, but more specifically, I compare how different fandoms (communities surrounding specific medias) may write or tag their works differently. I used data from the website Archive of Our Own to do so. Relevant findings were primarily in relationship to the RPF (real person fic) fandom for BTS. In comparisons, the BTS fandoms wrote and read explicit sexual works in higher volume than any other fandom, to a significant degree. Overall, the sorting of tags showed that fanfiction as a genre places an emphasis on emotion based tagging and sorting. Tags like “Fluff,” “Angst,” and “Hurt/Comfort,” topped the frequency charts for tags in both popular works and specific fandoms. In fanfiction, people don’t search for specific genres but rather for feelings. The Archive of Our Own categorization system also includes a section to select the genders of the characters in main relationships. While there were tons of works under the M/M (male/male relationship) tag, very few were under the F/F (female/female relationship) tag. The discrepancies and variation in the categories section requires further humanities based research to explain. My focus moving forward will be in RPF based fandoms and modern parasocial relationships.

Introduction

Reading fanfiction feels different than reading traditionally published literature. While fanfiction might come with popular connotations of being romantic or sexual in nature, fanfiction is more of a mode of writing than it is one specific genre. Using The user generated tagging systems and topic modelling, I want to figure out what makes fanfiction what it is. What topics are most popular, and how do people categorize them? In a future project, I intend to do a similar analysis of popular published literature, but this serves as a starting point for comparing fanfiction to standardized literary forms. Fanfiction has often been studied as a cultural, psychological object, but lacks research as a form of literature, and I want to investigate why it is or why it might not be.

Most of the code here was written referencing the notebooks in this GitHub repo from the Berkeley D-Lab, written by Evan Muzzall and edited by Brooks Jessup.

Research Questions

Primary Research Question. How is fanfiction tagged, and what does this categorization say about fanfiction as a whole?
1. Are there any specific names or relationships that show up disproportionately?
2. What are the percentages of M/M stories, F/F stories, F/M stories, and other gender combinations? What might that say about the genre?
3. How popular is fanfiction? How popular are specific fandoms? Does the amount of clicks line up with the amount of likes?

Dataset

Using a webscraper by UC Berkeley graduate student Sarah Sterman and Stanford student Jingyi Li, I collected the data and full text from the top 3.5k works (aka “fics”), as sorted by likes (or as Ao3 calls them, kudos) of fanfiction on the popular fanfiction website Archive of Our Own.

I scraped much smaller data sets as comparison points. Archive of Our Own’s default sorting is by “Most Recently Updated,” so to get a more generally representative dataset, popularity irrelevant, I scraped the most recent 500 fics of individual fandoms. I picked large fandoms that I have some tangential knowledge of that are based on different forms of media: Anime, RPF (real person fic, think: celebrities), Movies, and Books. I also grabbed the data for the most recent 500, fandom and popularity irrelevant, as a sort of control.

These are rather small datasets, in comparison with the amount of works in fandoms as a whole, but serve as a proof of concept until I can access larger data sets.

Core Data Set – Sorted by Kudos, Top 3500
Comparison Data Sets
- Recent – All Fandoms, Most Recent 500
- Anime Sample – My Hero Academia, Most Recent 500
- RPF Sample – BTS (K-pop Band), Most Recent 500
- Movie Sample – Marvel Cinematic Universe, Most Recent 500
- Book Sample – Harry Potter, Most Recent 500

Frequency Distributions

Rating Distribution

Ratings on Archive of Our Own are similar to movie ratings, in that they are used to show how “appropriate” a work is. “Explicit” is for either explicitly sexual or perhaps highly violent, “Mature” is for perhaps impilcations of sexual content, language, or anything else that makes it above “Teen and Up Audiences.” “General Audiences” contain nothing that require a warning, and “Not Rated” could be anything.

Remember, these are all tagged by the creators, so there may be wide variance in these definitions and the content of these works.

Awesome, it worked! Here we can see that “Teen and Up” is the most popular rating in our corpus of popular works, but “Explicit” follows closely behind.

Now let’s see how that compares to our other fandoms and datasets! I’m just doing the same thing to this data that I did above.

The immediate outliers here are in the explicit tag. It looks like, outside of the BTS fandom, explicit works are more popular with readers than they are for writers. Popular fics are also very rarely not rated, where it’s not all that uncommon for unrated works to be written and posted. Heads up, writers: if you want your work to be read, make sure to rate it, and Explicit-tagged works are reader’s favorite.

Tag Distribution

Tags are the user-generated subgenre markers on works in Archive of Our Own.

We can see how popular emotion based tagging is in fanfiction. People are searching more emotional genre markers more than they are for theme/setting markers.

I’m going to start processing the rest of the datasets in the exact same way.

This is a lot of information in one graph, I know, but there’s a lot of super interesting information here. I think the most can be learned by looking at the outliers. The one that stands out the most, just looking at it, is that spike for RPF/BTS under “Smut.” That’s the tag for basically written porn. Our representation for “real person fic” is the most explicit, as shown in the previous rating distribution as well. In modern fan scholarship, fanfiction is often seen as the response to repression and expression of desire, so why is it so stark when the characters are real people?

Category (Romantic Pairing Type) Distribution

Archive of Our Own has a very odd use of the term “category.” If the above data has proven anything, it should be that romance is central to fanfiction as a genre. Ao3’s use of “category” reflects that. It is used to reference the genders of the characters in romantic relationships. The categories are: F/M (female/male, heterosexual), M/M (male/male), F/F (female/female), Multi (poly relationships, genders unspecified), Gen (no romantic relationships), and Other.

Note the prevalence, just in these categories, of non-heterosexual relationships. Fanfiction is known for being queer reimaginings of popular media, but how much does that hold up?

Whoa! Each fandom has very different distributions here. Some of them make sense; BTS is only men, of course the romantic relationships are going to be mostly M/M. What’s interesting, though, is that in popular fic, M/M is still even more predominant than it is in BTS. In the “Recent” and “MCU” cateories, F/M is even more popular than M/M. That shows that even though people are writing F/M works, people are reading and liking M/M works with more volume. I’m eager to look into each of these fandoms more to figure out why they are all so different.

Word2Vec

I had been hoping that these models would help me show the importance of emotion and relationships within the genre of fanfiction, and it only partially did that. While not central to any of my arguments or research questions, this model is still fun to play with and shows, at the very least, how much “ships” (popular relationships in fandoms) affect how this data is viewed.

Models created using the help of this tutorial by Kavita Ganesan.

This doesn’t honestly help me all that much, but look at the bottom left corner. Look at those two dots, right on top of each other: “stiles” and “derek.” Those are the two names in an incredibly popular ship out of the Teen Wolf fandom; see also the proximity between “clarke” and “lexa,” a ship from the show The 100. In this corpus, imagined romantic relationships define the word relationships as well.

Conclusions

This is part of a much longer period of study. I have been researching and writing about non-normative literature for years, and based on extant scholarship on the subject, I expected the writings on Archive of Our Own to show very specific tropes, writing styles, ideas, and concepts. For the most part, that was true. All barplots are sorted in the order of most to least frequent tag within my core dataset of popular works.

All the most popular tags are emotionally driven, with a focus on how a work may make the reader feel. “Fluff,” the most popular tag, is a term fairly unique to fanfiction, and means that the writer fews the work as something light-hearted and intends to bring no negative feeling. “Angst,” the second most popular, similarly describes a feeling, rather than a theme. While many of my individual fandom samples followed the same trends as the “Popular” works, in some cases, individual fandoms or my small sample of “Recent” works had stark differences. For example, in the Marvel Cinematic Universe fandom there are more works with F/M relationships than M/M, whereas in popular works, over 66% of the relationship tags are M/M.

My initial research question on what tagging says about fanfiction, though, is that it is driven by emotions. There is no “science fiction” or “fantasy” tag, but instead “Angst,” “Fluff,” and “Hurt/Comfort.” Fanfiction is not about themes, it is about making the reader feel a specific way.

Further Research

That outlier spike in the lower half of Graph 1 indicates that the BTS fandom has almost twice as many works tagged “Smut” than any other fandom. In further investigation, I want to know if that trend continues across other RPF, and why this fandom is such an outlier. I also want to give the scraper more time and get bigger data sets from individual fandoms. 500 is a relatively small sample and I could get so much more information from these and other fandoms. I would also like to analyze topic models in more depth; my Word2Vec model didn’t get me anything useful other than ship-based character associations but perhaps TF-IDF or Doc2Vec might.

I want to look specifically at “RPF” fandoms; my work outside of data is focused on parasocial relationships, and fandom is an incredible expression of that type of connection. The BTS fandom gave me the most interesting data, I think, and I want to use that as a jumping off point to use psychoanalytic criticism to look at how desire is shaped in a digital age.