A Language for Every Reddit

T-SNE, LIWC, and Subreddits

This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!

A Language for Every Reddit

T-SNE, LIWC, and Subreddits

The work below was a part of my final project in my MS CS. It visualizes the relationship between various subreddits and groups of subreddits based on their word use, and its visualizations show, or at least strongly suggest, that each of these online social groups uses a different vocabulary. This project touches on many of my favorite topics, including dimensionality reduction, data visualization, linguistic style, big data, and NLP for mental health.

Research related to this project spanned most of 2018 and some of 2019. Initial versions of this document were completed in March of 2019. Related PySpark code, presentations, etc. can be found on Github.

T-SNE and Reddit

Online forums provide a wealth of language data and even, by the topical nature of the forums, come complete with some labeling. One can assume, with some level of confidence, that the language of a forum titled 'teenagers' has a strong relationship with the language of teenagers themselves, and the occasional misleading title is manageable. Researchers have already made some use of loosely labeled data to research patterns in the mental health community. The work below was inspired by Al-Mosaiwi and Johnstone's In an Absolute State which showed increased use of absolutist words ('always', 'never', 'completely', etc.) in forums concerning depression, anxiety, and suicidal ideation. This analysis uses Al-Mosaiwi and Johnstone's list of absolutist words alongside several other sentiment dictionaries (see also Linguistic Inquiry and Word Count (LIWC) software) representing categories such as functional, negation, positive emotion, negative emotion, body, work, leisure, and money related words.

The remarkable result is that the language use each subreddit, even when represented by these dictionary frequencies alone, is completely distinguishable from other subreddits, and subreddits also tend to form clusters within the categories they were selected to represent (video gaming, mental health, computers, relationships, recreational drugs, sports, and media). The easiest way to see this result is via t-SNE dimensionality reduction, which takes as input dictionary frequencies (or a normalized, PCA reduced representation thereof) and outputs two-dimensional points, which are graphed below. For more information on the t-SNE algorithm, see the original paper on it or these slides introducing the concept (additional presentation notes to come). \

The notebook code and graphs below present the data and resulting visualizations. I've removed the least interesting code blocks; for the curious, that code, as well as a few other examples from other projects, is available in this folder on my GitHub. If you're interested in these results, my work on forum language, or how else I think t-SNE visualizations could help make NLP and DL more understandable, please email me!

Embed Section Title

Embed Section Subtitle

This is your Embedded item paragraph. It’s a great place to add a description of your written, illustrated or visual content, as well as any other format that you have embedded in your site. Don’t hesitate to use this space to add valuable information for your users, and encourage them to take action.