Tuesday, March 5, 2024

Lexicon-Primarily based Sentiment Evaluation Utilizing R | by Okan Bulut | Feb, 2024

Must read


For the sake of simplicity, we are going to concentrate on the primary wave of the pandemic (March 2020 — June 2020). The transcripts of all media briefings had been publicly out there on the federal government of Alberta’s COVID-19 pandemic web site (https://www.alberta.ca/covid). This dataset comes with an open information license that permits the general public to entry and use the data, together with for industrial functions. After importing these transcripts into R, I turned all of the textual content into lowercase after which utilized phrase tokenization utilizing the tidytext and tokenizers packages. Phrase tokenization cut up the sentences within the media briefings into particular person phrases for every entry (i.e., day of media briefings). Subsequent, I utilized lemmatization to the tokens to resolve every phrase into its canonical type utilizing the textstem bundle. Lastly, I eliminated frequent stopwords, resembling “my,” “for,” “that,” “with,” and “for, utilizing the stopwords bundle. The ultimate dataset is on the market right here. Now, let’s import the information into R after which assessment its content material.

load("wave1_alberta.RData")

head(wave1_alberta, 10)

A preview of the dataset (Picture by writer)

The dataset has three columns:

  • month (the month of the media briefing)
  • date (the precise date of the media briefing), and
  • phrase (phrases or tokens utilized in media briefing)

Descriptive Evaluation

Now, we will calculate some descriptive statistics to raised perceive the content material of our dataset. We are going to start by discovering the highest 5 phrases (based mostly on their frequency) for every month.

library("dplyr")

wave1_alberta %>%
group_by(month) %>%
depend(phrase, type = TRUE) %>%
slice_head(n = 5) %>%
as.information.body()

High 5 phrases by months (Picture by writer)

The output reveals that phrases resembling well being, proceed, and check had been generally used within the media briefings throughout this 4-month interval. We will additionally broaden our record to the most typical 10 phrases and examine the outcomes visually:

library("tidytext")
library("ggplot2")

wave1_alberta %>%
# Group by month
group_by(month) %>%
depend(phrase, type = TRUE) %>%
# Discover the highest 10 phrases
slice_head(n = 10) %>%
ungroup() %>%
# Order the phrases by their frequency inside every month
mutate(phrase = reorder_within(phrase, n, month)) %>%
# Create a bar graph
ggplot(aes(x = n, y = phrase, fill = month)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~ month, scales = "free_y") +
labs(x = "Frequency", y = NULL) +
theme(legend.place = "none",
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11),
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 13))

Most typical phrases based mostly on frequency (Picture by writer)

Since some phrases are frequent throughout all 4 months, the plot above could not essentially present us the necessary phrases which are distinctive to every month. To seek out such necessary phrases, we will use Time period Frequency — Inverse Doc Frequency (TF-IDF)–a extensively used method in NLP for measuring how necessary a time period is inside a doc relative to a set of paperwork (for extra detailed details about TF-IDF, take a look at my earlier weblog publish). In our instance, we are going to deal with media briefings for every month as a doc and calculate TF-IDF for the tokens (i.e., phrases) inside every doc. The primary a part of the R codes beneath creates a brand new dataset, wave1_tf_idf, by calculating TF-IDF for all tokens and choosing the tokens with the very best TF-IDF values inside every month. Subsequent, we use this dataset to create a bar plot with the TF-IDF values to view the frequent phrases distinctive to every month.

# Calculate TF-IDF for the phrases for every month
wave1_tf_idf <- wave1_alberta %>%
depend(month, phrase, type = TRUE) %>%
bind_tf_idf(phrase, month, n) %>%
organize(month, -tf_idf) %>%
group_by(month) %>%
top_n(10) %>%
ungroup

# Visualize the outcomes
wave1_tf_idf %>%
mutate(phrase = reorder_within(phrase, tf_idf, month)) %>%
ggplot(aes(phrase, tf_idf, fill = month)) +
geom_col(present.legend = FALSE) +
facet_wrap(~ month, scales = "free", ncol = 2) +
scale_x_reordered() +
coord_flip() +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 13),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11)) +
labs(x = NULL, y = "TF-IDF")

Most typical phrases based mostly on TIF-IDF (Picture by writer)

These outcomes are extra informative as a result of the tokens proven within the determine replicate distinctive matters mentioned every month. For instance, in March 2020, the media briefings had been largely about limiting journey, getting back from crowded conferences, and COVID-19 circumstances on cruise ships. In June 2020, the main target of the media briefings shifted in the direction of masks necessities, folks protesting pandemic-related restrictions, and so forth.

Earlier than we change again to the sentiment evaluation, let’s check out one other descriptive variable: the size of every media briefing. This can present us whether or not the media briefings grew to become longer or shorter over time.

wave1_alberta %>%
# Save "day" as a separate variable
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
# Rely the variety of phrases
summarize(n = n()) %>%
ggplot(aes(day, n, coloration = month, form = month, group = month)) +
geom_point(measurement = 2) +
geom_line() +
labs(x = "Days", y = "Variety of Phrases") +
theme(legend.place = "none",
axis.textual content.x = element_text(angle = 90, measurement = 11),
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.y = element_text(measurement = 11)) +
ylim(0, 800) +
facet_wrap(~ month, scales = "free_x")
Variety of phrases within the media briefings by day (Picture by writer)

The determine above reveals that the size of media briefings diversified fairly considerably over time. Particularly in March and Could, there are bigger fluctuations (i.e., very lengthy or brief briefings), whereas, in June, the every day media briefings are fairly comparable when it comes to size.

Sentiment Evaluation with tidytext

After analyzing the dataset descriptively, we’re prepared to start with the sentiment evaluation. Within the first half, we are going to use the tidytext bundle for performing sentiment evaluation and computing sentiment scores. We are going to first import the lexicons into R after which merge them with our dataset. Utilizing the Bing lexicon, we have to discover the distinction between the variety of optimistic and damaging phrases to supply a sentiment rating (i.e., sentiment = the variety of optimistic phrases — the variety of damaging phrases).

# From the three lexicons, Bing is already out there within the tidytext web page
# for AFINN and NRC, set up the textdata bundle by uncommenting the subsequent line
# set up.packages("textdata")
get_sentiments("bing")
get_sentiments("afinn")
get_sentiments("nrc")

# We are going to want the unfold perform from tidyr
library("tidyr")

# Sentiment scores with bing (based mostly on frequency)
wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("bing")) %>%
depend(month, day, sentiment) %>%
unfold(sentiment, n) %>%
mutate(sentiment = optimistic - damaging) %>%
ggplot(aes(day, sentiment, fill = month)) +
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-50, 50) +
theme(legend.place = "none", axis.textual content.x = element_text(angle = 90)) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11))

Sentiment scores based mostly on the Bing lexicon (Picture by writer)

The determine above reveals that the emotions delivered within the media briefings had been usually damaging, which isn’t essentially stunning because the media briefings had been all about how many individuals handed away, hospitalization charges, potential outbreaks, and so on. On sure days (e.g., March 24, 2020 and Could 4, 2020), the media briefings had been significantly extra damaging when it comes to sentiments.

Subsequent, we are going to use the AFINN lexicon. Not like Bing that labels phrases as optimistic or damaging, AFINN assigns a numerical weight to every phrase. The signal of the load signifies the polarity of sentiments (i.e., optimistic or damaging), whereas the worth signifies the depth of sentiments. Now, let’s see if these weighted values produce completely different sentiment scores.

wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(month, day) %>%
summarize(sentiment = sum(worth),
kind = ifelse(sentiment >= 0, "optimistic", "damaging")) %>%
ggplot(aes(day, sentiment, fill = kind)) +
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-100, 100) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(legend.place = "none",
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11, angle = 90),
axis.textual content.y = element_text(measurement = 11))
Sentiment scores based mostly on the AFINN lexicon (Picture by writer)

The outcomes based mostly on the AFINN lexicon appear to be fairly completely different! As soon as we take the “weight” of the tokens into consideration, most media briefings become optimistic (see the inexperienced bars), though there are nonetheless some days with damaging sentiments (see the purple bars). The 2 analyses now we have finished thus far have yielded very completely different for 2 causes. First, as I discussed above, the Bing lexicon focuses on the polarity of the phrases however ignores the depth of the phrases (dislike and hate are thought-about damaging phrases with equal depth). Not like the Bing lexicon, the AFINN lexicon takes the depth into consideration, which impacts the calculation of the sentiment scores. Second, the Bing lexicon (6786 phrases) is pretty bigger than the AFINN lexicon (2477 phrases). Due to this fact, it’s seemingly that some tokens within the media briefings are included within the Bing lexicon, however not within the AFINN lexicon. Disregarding these tokens may need impacted the outcomes.

The ultimate lexicon we’re going to attempt utilizing the tidytext bundle is NRC. As I discussed earlier, this lexicon makes use of Plutchik’s psych-evolutionary idea to label the tokens based mostly on fundamental feelings resembling anger, worry, and anticipation. We’re going to depend the variety of phrases or tokens related to every emotion after which visualize the outcomes.

wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("nrc")) %>%
depend(month, day, sentiment) %>%
group_by(month, sentiment) %>%
summarize(n_total = sum(n)) %>%
ggplot(aes(n_total, sentiment, fill = sentiment)) +
geom_col(present.legend = FALSE) +
labs(x = "Frequency", y = "") +
xlim(0, 2000) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11))
Sentiment scores based mostly on the NRC lexicon (Picture by writer)

The determine reveals that the media briefings are largely optimistic every month. Dr. Hinshaw used phrases related to “belief”, “anticipation”, and “worry”. General, the sample of those feelings appears to stay very comparable over time, indicating the consistency of the media briefings when it comes to the sort and depth of the feelings delivered.

One other bundle for lexicon-based sentiment evaluation is sentimentr (Rinker, 2021). Not like the tidytext bundle, this bundle takes valence shifters (e.g., negation) into consideration, which might simply flip the polarity of a sentence with one phrase. For instance, the sentence “I’m not sad” is definitely optimistic, but when we analyze it phrase by phrase, the sentence could appear to have a damaging sentiment as a result of phrases “not” and “sad”. Equally, “I hardly like this e book” is a damaging sentence however the evaluation of particular person phrases, “hardly” and “like”, could yield a optimistic sentiment rating. The sentimentr bundle addresses the constraints round sentiment detection with valence shifters (see the bundle writer Tyler Rinker’s Github web page for additional particulars on sentimentr: https://github.com/trinker/sentimentr).

To profit from the sentimentr bundle, we’d like the precise sentences within the media briefings reasonably than the person tokens. Due to this fact, I needed to create an untokenized model of the dataset, which is on the market right here. We are going to first import this dataset into R, get particular person sentences for every media briefing utilizing the get_sentences() perform, after which calculate sentiment scores by day and month through sentiment_by().

library("sentimentr")
library("magrittr")

load("wave1_alberta_sentence.RData")

# Calculate sentiment scores by day and month
wave1_sentimentr <- wave1_alberta_sentence %>%
mutate(day = substr(date, 9, 10)) %>%
get_sentences() %$%
sentiment_by(textual content, record(month, day))

# View the dataset
head(wave1_sentimentr, 10)

A preview of the dataset (Picture by writer)

Within the dataset we created, “ave_sentiment” is the typical sentiment rating for every day in March, April, Could, and June (i.e., days the place a media briefing was made). Utilizing this dataset, we will visualize the sentiment scores.

wave1_sentimentr %>%
group_by(month, day) %>%
ggplot(aes(day, ave_sentiment, fill = ave_sentiment)) +
scale_fill_gradient(low="purple", excessive="blue") +
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-0.1, 0.3) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(legend.place = "none",
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11, angle = 90),
axis.textual content.y = element_text(measurement = 11))
Sentiment scores based mostly on sentiment (Picture by writer)

Within the determine above, the blue bars signify extremely optimistic sentiment scores, whereas the purple bars depict comparatively decrease sentiment scores. The patterns noticed within the sentiment scores generated by sentimentr carefully resemble these derived from the AFINN lexicon. Notably, this evaluation relies on the unique media briefings reasonably than solely tokens, with consideration given to valence shifters within the computation of sentiment scores. The convergence between the sentiment patterns recognized by sentimentr and people from AFINN just isn’t totally sudden. Each approaches incorporate comparable weighting methods and mechanisms that account for phrase depth. This alignment reinforces our confidence within the preliminary findings obtained via AFINN, validating the consistency and reliability of our analyses with sentiment.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article