Rick and Morty and Tidy Data Principles

R-bloggers 2017-10-14

Summary:

MotivationAfter reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I thought about doing something similar but reproducible and focused on Rick and Morty.In this post I'll focus on the Tidy Data principles. However, here is the Github repo with the scripts to scrap the transcripts and subtitles of Rick and Morty.Here I'm using the subtitles of the TV show, as some of the transcripts I could scrap were incomplete.Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.Let's scrapThe subtools package returns a data frame after reading srt files. In addition to that resulting data frame I wanted to explicitly point the season and chapter of each line of the subtitles. To do that I had to scrap the subtitles and then use str_replace_all. To follow the steps clone the repo from Github:git clone https://github.com/pachamaltese/rick_and_morty_tidy_textRick and Morty Can Be So TidyAfter reading the tidy file I created after scraping the subtitles, I use unnest_tokens to divide the subtitles in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.if (!require("pacman")) install.packages("pacman")p_load(data.table,tidyr,stringr,tidytext,dplyr,janitor,ggplot2,viridis,ggstance,igraph)p_load_gh("thomasp85/ggraph","dgrtwo/widyr")rick_and_morty_subs = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/rick_and_morty_subs.csv"))rick_and_morty_subs_tidy = rick_and_morty_subs %__% unnest_tokens(word,text) %__% anti_join(stop_words)The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join to remove words such a "a", "an" or "the".Then we can use count to find the most common words in all of Rick and Morty episodes as a whole.rick_and_morty_subs_tidy %__% count(word, sort = TRUE)# A tibble: 8,100 x 2 word n 1 morty 1842 2 rick 1625 3 jerry 621 4 yeah 484 5 gonna 421 6 hey 391 7 summer 389 8 uh 331 9 time 31910 beth 295# ... with 8,090 more rowsSentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment dataset. Let’s examine how sentiment changes changes during each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.bing = sentiments %__% filter(lexicon == "bing") %__% select(-score)bing# A tibble: 6,788 x 3 word sentiment lexicon 1 2-faced negative bing 2 2-faces negative bing 3 a+ positive bing 4 abnormal negative bing 5 abolish negative bing 6 abominable negative bing 7 abominably negative bing 8 abominate negative bing 9 abomination negative bing10 abort negative bing# ... with 6,778 more rowsrick_and_morty_sentiment = rick_and_morty_subs_tidy %__% inner_join(bing) %__% count(episode_name, index = linenumber %/% 50, sentiment) %__% spread(sentiment, n, fill = 0) %__% mutate(sentiment = positive - negative) %__% left_join(rick_and_morty_subs_tidy[,c("episode_name","season","episode")] %__% distinct()) %__% arrange(season,episode) %__% mutate(episode_name = paste(season,episode,"-",episode_name), season = factor(season, labels = c("Season 1", "Season 2", "Season 3"))) %__% select(episode_name, season, everything(), -episode)rick_and_morty_sentiment# A tibble: 431 x 6 episode_name season index negative positive sentiment 1 S01 E01 - Pilot Season 1 0 6 3 -3 2 S01 E01 - Pilot Season 1 1 10 0 -10 3 S01 E01 - Pilot Season 1 2 3 1 -2 4 S01 E01 - Pilot Season 1 3 10 4 -6 5 S01 E01 - Pilot Season 1 4 2 5 3 6 S01 E01 - Pilot Season 1 5 8 4 -4 7 S01 E01 - Pilot Season 1 6 6 1 -5 8 S01 E01 - Pilot Season 1 7 7 4 -3 9 S01 E01 - Pilot Season 1 8 14 5 -910 S01 E01 - Pilot Season 1 9 3 2 -1# ... with 421 more rowsNow we can plot these sentiment scores across the plot trajectory of each novel. In the second plot I'm just showing Dan Harmon's favourite episodes provided to the moment the show has 31 episodes in total.ggplot(rick_and_morty_sentiment, aes(index, sentiment, fill = season)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~season, nrow = 3, scales = "free_x", dir = "v") + theme_minimal(base_size = 13) + labs(title = "Sentiment in Rick and Morty", y = "Sentiment") + scale_fill_viridis(end = 0.75, discrete=TRUE) + scale_x_discrete

Link:

http://feedproxy.google.com/~r/RBloggers/~3/pch-GlhmCrY/

From feeds:

Statistics and Visualization » R-bloggers

Authors:

Mauricio Vargas S. 帕夏

Date tagged:

10/14/2017, 01:57

Date published:

10/12/2017, 12:00