Mapping the Underlying Social Structure of Reddit

R-bloggers 2019-09-28

Summary:

Reddit is a popular website for opinion sharing and news aggregation. The site consists of thousands of user-made forums, called subreddits, which cover a broad range of subjects, including politics, sports, technology, personal hobbies, and self-improvement. Given that most Reddit users contribute to multiple subreddits, one might think of Reddit as being organized into many overlapping communities. Moreover, one might understand the connections among these communities as making up a kind of social structure. Uncovering a population’s social structure is useful because it tells us something about that population’s identity. In the case of Reddit, this identity could be uncovered by figuring out which subreddits are most central to Reddit’s network of subreddits. We could also study this network at multiple points in time to learn how this identity has evolved and maybe even predict what it’s going to look like in the future. My goal in this post is to map the social structure of Reddit by measuring the proximity of Reddit communities (subreddits) to each other. I’m operationalizing community proximity as the number of posts to different communities that come from the same user. For example, if a user posts something to subreddit A and posts something else to subreddit B, subreddits A and B are linked by this user. Subreddits connected in this way by many users are closer together than subreddits connected by fewer users. The idea that group networks can be uncovered by studying shared associations among the people that make up those groups goes way back in the field of sociology (Breiger 1974). Hopefully this post will demonstrate the utility of this concept for making sense of data from social media platforms like Reddit.1 Data The data for this post come from an online repository of subreddit submissions and comments that is generously hosted by data scientist Jason Baumgartner. If you plan to download a lot of data from this repository, I implore you to donate a bit of money to keep Baumgartner’s database up and running (pushshift.io/donations/). Here’s the link to the Reddit submissions data - files.pushshift.io/reddit/submissions/. Each of these files has all Reddit submissions for a given month between June 2005 and May 2019. Files are JSON objects stored in various compression formats that range between .017Mb and 5.77Gb in size. Let’s download something in the middle of this range - a 710Mb file for all Reddit submissions from May 2013. The file is called RS_2013-05.bz2. You can double-click this file to unzip it, or you can use the following command in the Terminal: bzip2 -d RS_2013-05.bz2. The file will take a couple of minutes to unzip. Make sure you have enough room to store the unzipped file on your computer - it’s 4.51Gb. Once I have unzipped this file, I load the relevant packages, read the first line of data from the unzipped file, and look at the variable names. read_lines("RS_2013-05", n_max = 1) %__% fromJSON() %__% names ## [1] "edited" "title" ## [3] "thumbnail" "retrieved_on" ## [5] "mod_reports" "selftext_html" ## [7] "link_flair_css_class" "downs" ## [9] "over_18" "secure_media" ## [11] "url" "author_flair_css_class" ## [13] "media" "subreddit" ## [15] "author" "user_reports" ## [17] "domain" "created_utc" ## [19] "stickied" "secure_media_embed" ## [21] "media_embed" "ups" ## [23] "distinguished" "selftext" ## [25] "num_comments" "banned_by" ## [27] "score" "report_reasons" ## [29] "id" "gilded" ## [31] "is_self" "subreddit_id" ## [33] "link_flair_text" "permalink" ## [35] "author_flair_text" For this project, I’m only interested in three of these variables: the user name associated with each submission (author), the subreddit to which a submission has been posted (subreddit), and the time of submission (created_utc). If we could figure out a way to extract these three pieces of information from each line of JSON we could greatly reduce the size of our data, which would allow us to store multiple months worth of information on our local machine. Jq is a command-line JSON processor that makes this possible. To install jq on a Mac, you need to make sure you have Homebrew (brew.sh/), a package manager that works in the Terminal. Once you have Homebrew, in the Terminal type brew install jq. I’m going to use jq to extract the variables I want from RS_2015-03 and save the result as a .csv file. To select variables with jq, list the JSON field names that you want like this: [.author, .created_utc, .subreddit]. I return these as raw output (-r) and render this as a csv file (@csv). Here’s the command that does all this: jq -r '[.author, .created_utc, .subreddit] | @csv' RS_2013-05 __ parsed_json_to_csv_2013_05

Authors:

Posts on Data Science Diarist

Date tagged:

09/28/2019, 08:51

Date published:

09/27/2019, 14:27

Mapping the Underlying Social Structure of Reddit

R-bloggers 2019-09-28

Summary:

Link:

From feeds:

Tags:

Authors:

Date tagged:

Date published: