少点错误 05月19日 23:52
Semen and Semantics: Understanding Porn with Language Embeddings
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章通过分析Pornhub网站2008年至2023年的标题,揭示了色情内容标题的演变趋势。研究者利用语言嵌入技术,将标题转化为向量进行分析,发现标题经历了从强调外貌、性行为到后来侧重于乱伦和性暴力的转变。文章还探讨了关键词趋势、不同类别内容的关联性,以及标题准确性等问题,最终得出结论:色情标题在性暴力描述方面已接近极限。

🧐 分析显示,Pornhub的标题演变可分为三个时期:2008-2009年、2010-2016年和2017年至今。2017年至今的标题主要特点是强调乱伦和其他形式的性暴力。

💡 早期标题通常较短,侧重于描述外貌特征和性行为,如“金发女郎”和“肛交”。

😮 后期标题则变得更长,并出现了乱伦和暴力的趋势,例如“继妹”和“强奸”。

📈 关键词分析揭示了不同主题的变化趋势,例如“拉丁裔”一词的使用频率随着时间推移而下降,而“乱伦”和“强奸”等关键词的使用则显著增加。

Published on May 19, 2025 3:39 PM GMT

Summary

Porn content has gotten more extreme over time. Here's the average title for the first full year of Pornhub's existence, 2008:

and here's the average title for 2023:

Why did this change happen? We can understand porn's progression by converting titles to language embeddings. I downloaded Internet Archive snapshots of "pornhub.com" from 2008 - 2023 and analyzed the embeddings of the titles on the main page.

I found three distinct eras of titling: 2008-2009, 2010-2016, 2017-present. The current trend, since 2017, is characterized mainly by an emphasis on incest and other sexual violence.

Titles are generally representative of actual video content, and provide a reasonable heuristic for measuring actual content change, though some SEO effects exist.

The conclusion is a slightly ominous one: we are close to semantic bedrock with respect to sexual violence. Porn titles cannot become more sexually violent in their descriptions, because we lack the vocabulary.

Data and Methods

Download the repo and run "pip install" to install dependencies.

Pre-downloaded data is located in the "snapshots" folder. Pornhub data goes back to 2007, but analysis begins in 2008, when the format became more consistent. We have a folder for each month of the year, and a roughly weekly cadence of snapshots. For each date, there are two files, e.g.: "20080606.html", the raw HTML file, and "20080606.json", which contains the parsed video titles. The JSON files is an array of dictionaries like so:

{ "title": "Quickie on the car?", "url": "/view_video.php?viewkey=9aeff09be64077906196", "views": "39183", "duration": "7:39\n \t7 hours ago", "embedding": ... }

where the "embedding" field is the "title" value converted by OpenAI's "text-embedding-3-large". The URL format changes slightly over time.

From 4416 available snapshots, we end up with 772 weekly snapshots. Typically, we'll segregate these by year in order to form legible boundaries.

To download more data, run "fetch_snapshots.py" in the "data_retrieval" directory. You can change the website by editing the Python file.

To work with embeddings, you will need an OpenAI API key. Set it with export OPENAI_API_KEY={...}.

Title Accuracy

Do the titles reflect the actual contents of the videos? If the answer is no, analyzing video titles may not tell us much.

In order to construct an estimator of title accuracy, I provided tools for human reviewers to use in analysis. See "run_title_accuracy.py" and "analysis_results/title_accuracy_logs/readme" for more details. The gist of it is that you can generate a sample of videos by category, or year, or overall, then navigate to the link, and rate the accuracy on a scale of 1-5. You can then analyze your review and get results like so:

File: ../analysis_results/title_accuracy_logs/title_accuracy_2014.json Total samples: 10 Video Not Available (Null): 7/10 Samples Average Score (for available videos): 5.00

Many older titles have dead URLs, but of those that remain (typically more recent videos), I find that "pure SEO effects" are not very common, and that the title is a reasonable descriptor of the video contents.

Calculating Yearly Centroids

We calculate the representative porn for a year like so:

This gives us the "centroid" which is our representative embedding for the year. We calculate the daily average first to moderate the impact of changes within the year.

Centroid Similarity

We'll start by looking at how different each centroid is to every other centroid, as seen below:

We see 3 periods emerging: 2008-2009, 2010-2016, and 2017-2023.

Run "run_centroids" to reproduce.

Centroid Clusters

We can do the same thing with t-SNE:

The trends are similar to what we see in the heatmap: 2008 and 2009 are close, but not quite part of, the 2010-2016 cluster, and we see 2016 starting to edge away from its cluster mates. There have been at least two distinct epochs of video titling conventions in Pornhub's history.

Centroid Titles

Consequently, to find the representative video title for the year, we can take the centroid, and find its nearest neighbor for the given year - out of the titles for say, 2010, which is the closest to "average"? They are as follows:

This sheds some light on our previous findings:

Run "run_nearest_neighbors" to reproduce; increase the value for K (the number of neighbors) to see more titles.

These results are informative but not conclusive. Let's observe trends.

Keyword Trends

We can observe keyword trends like so:

    We create a reference embedding, like "latina"We get the cosine similarity of the reference against every title in our datasetWe convert the raw similarity into a normalized z scoreWe take the top 10% most similar scores from the whole setWe count how many of the top 10% scores are in each yearWe adjust for the number of titles in each year - if 2010 only has 100 titles and 2020 has 200, as a baseline we'd expect 2010 to have 10 relevant examples and 2020 to have 20

If we do this for e.g. "latina" we get:

YearMatchesTotalRateNormalized
2008181140.1581.58x
2009181260.1431.43x
2010121260.0950.95x
2011332580.1281.28x
2012363120.1151.16x
2013403060.1311.31x
2014293060.0950.95x
2015433060.1411.41x
2016152820.0530.53x
2017412940.1391.40x
2018272640.1021.02x
2019142640.0530.53x
2020182880.0620.63x
2021273060.0880.88x
2022183120.0580.58x
2023262940.0880.89x

which looks like this:

"latina" as a descriptor here has lost marketshare over time.

As a mild control, let's look at the word "orthogonal", which should probably be unrelated.

The 2016 jump might indicate the general increase in complexity of titles around that time. This mirrors what we see with the clusters, where 2016 was a transitional year.

Finally, let's take a look at the sexual violence trends, with incest and rape:

For both, an obvious jump and sustained increase. Incest is outperforming rape, as we can observe from the "step-" titles and their variants.

For better visibility and a smoother trend, we can also observe the animated moving average.

Run "run_trend" with an array of words of your choice to run your own analysis.

t-SNE Clusters

We'll return to t-SNE to take a closer look at some new clusters. Similar to our keywords, we create reference embeddings. This time, I made category groups of three, intended to cluster together, in order to see how categories relate to our early and late stage time periods. We can take distance of cluster as similarity.

Haircolor

"brunette", "blonde", "redhead"

Observing that hair color comes up frequently in early period titles, we include some here, but we see that they are not particularly close to either cluster of centroids.

Pornstar Names

"Maximus Thrust", "Ivana Delight", "Johnny Deep" (fictional names courtesy of ChatGPT)

Porn star names are more similar to the early years, but we observe proximity to the late period as well.

Violence

"murder", "suicide", "death"

Violence forms its own cluster. Possibly, titles are trending towards violence over time.

Women

"woman dancing", "woman cooking", "woman eating breakfast"

"Women doing activity" is a common format for titles and we observe some proximity here.

Men

"men digging ditches", "men lighting laterns", "men hiking the hills"

Men is much further away; we may infer that the subject performing the action is less relevant than the subject receiving it.

Racial

"african american", "latino", "asian"

Racial categories are a bit closer than men, since they are commonly included.

Manufacturing

"airplane factory", "blue collar", "manufacturing"

"Manufacturing" is meant as a pure control, unrelated to sex in general. But it's actually somewhat closer than men or racial groups.

Benign

"people in love", "healthy relationships", "moral behavior"

The benign terms are meant to offer a contrast to the sexual violence. They actually are relatively close, and along the same chronological trend as violence.

Sexual Violence

"woman being raped", "incest", "torture porn"

 

We observe a direct hit. Our sexually violent terms almost completely overlapping our late period titles: the two have become synonymous.

Here they are all at once:

Run "run_tsne" to visualize your own reference groups. By default, the script will first generate the mappings, and then show:

    The mapped yearsThe mapped years with each concept cluster individuallyEvery cluster and the mapped years

For a simpler animated analysis, show or hide different clusters to observe how the "average" moves over time:

Conclusions

The trends reflect the increasingly intense tastes of the highest spending, most engaged consumers.

Broadly this is because of professionalization: a shift from amateur, Youtube-style porn to professional studios with an interest in the bottom line. Interestingly, this mimics the evolution of Youtube itself as well. A broad, internet-wide shift towards monetization might be benign elsewhere, but in the porn domain, becomes a race to the bottom of sexual violence.

For a longer editorial, see here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Pornhub 标题演变 性暴力 内容分析
相关文章