Text Mining with R

Namitha Deshpande
Analytics Vidhya
Published in
4 min readJul 5, 2020

--

Feelings are complicated, but sentiment analysis need not be. Words have always been important when it comes to communicating concepts and emotions. Given the short attention span with which we now consume words on social media platforms, the choice of what words to use has become even more pressing.

Recently, I happen to come across “Text Mining with R” written by Julia Silge and David Robinson while exploring different datasets and packages available in R and was immediately drawn towards this book. I had no prior knowledge about text mining or sentimental analysis, and I decided to read the book.

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

In this article, I have decided to apply the concepts to the book I am currently reading which is “Wuthering Heights” by Emily Brontë. This and 60,000 other books can be downloaded from Project Gutenberg using the gutenbergr package.

I will start the sentiment analysis by installing and loading the required packages for the practicals. The most common way is to use the CRAN repository, then you just need the name of the package and use the command install.packages("package"). To load the installed package, use the library(package).

The following are the packages required for the package in this article:

library(dplyr)
library(tidytext)
library(gutenbergr)
library(wordcloud)
library(ggplot2)

Now, it is time to install the book for analysis using the Project Gutenberg book ID number.

WHeights_Bronte <- gutenberg_download(768)

The next step of the analysis is tidying the data for text mining and sentiment analysis. For this step, I will use unnest_token() package to convert a line(row) into individual words (tokens). I was able to download the text, separate it into individual words, and eliminate stop words (e.g., “the”, “of”, “to”) which are not informative using anti_join().

Tidy_Bronte <- Emily_Bronte %>%
unnest_tokens(word,text) %>%
anti_join(stop_words)

Word Frequencies

The first step was to quantify how often words were used across the 34 chapters of the novel, to have an initial idea of the content. So, I counted the number of occurrences for each word and selected only the most common ones (i.e., occurring more than100 times).

Unsurprisingly, the name of the protagonist is the most frequent word in the novel followed by Catherine and Earnshaw. It’s quite interesting to see the word Linton between Heathcliff and Catherine: as the love story of Heathcliff and Catherine ends when Linton marries Catherine.

Sentiment Analysis

The genre of this book is described as Dark, Tragedy and Gothic Fiction. I will verify the same with the help of sentiment analysis. In the previous viz, we saw the frequent words used in the book. Next, you will how these words contribute to different sentiments. I will use one of the several sentiment lexicons provided by tidytext package which is “NRC” for sentiment analysis.

nrc_emotions <- get_sentiments("nrc") %>%
filter(sentiment == "joy" |
sentiment == "anger" |
sentiment == "fear" |
sentiment == "sadness")
TB_emotions <- Tidy_Bronte %>%
inner_join(nrc_emotions) %>%
count(word, sentiment) %>%
arrange(sentiment)

I extracted sentiment classifications from the NRC Word-Emotion Association Lexicon and selected four emotions: anger, fear, joy, and sadness. Indeed, tragedy seems to be the main theme in the life of these characters of Wuthering Heights.

Word Cloud

Finally, I created a wordcloud of these words, visualizing positive vs negative sentiment of frequently occurring words using the “bing” sentiment lexicons.

Conclusion

This was so much fun! Now you have seen a basic implementation of Sentiment Analysis using the tidytext package. If you want to get more insights, look up the book Text Mining with R. Thank you for reading!

Code for the entire article found here.

--

--