Analyst Extra: Using word frequencies to guide social media rumour analysis

by Camille Fassett – Internews Data Analyst

When analyzing social media rumours about coronavirus, I often find it illuminating to break down text into the words that appear the most frequently. Doing so — particularly among sub-categories, like language groups, or social media platforms, or type of rumour — can be a valuable way to get an idea of what kinds of topics are especially prevalent in the rumours, without reading every post.

Due to the diversity of human language and expression, some extra steps are required before a word frequency is a useful piece of analysis. In social media text, like in verbal language, people use variants of words that essentially mean the same thing — “walks”, “walking”, and “walked” are all different forms of the same word.

Lemmatization is a process in which different inflected forms of the same word are grouped together into one form, so that all forms can be analyzed as the same word. (In the case above, all instances of those words would be replaced with their lemma: “walk”.)

Since I am not a linguist, and there is only so much time in the day, I use a Python library to do the job for me. I lemmatize the social media rumour text, which has already been translated into English by the Internews and Translators without Borders social media monitors. Then, I strip the text of stop words, or words that are not especially meaningful, such as “the”, “is” and “and”. Next, I join all of the social media rumour text that I am interested in into one block of text, and it is ready for word frequency analysis.

Before generating the following graphs, I also removed some words that do not appear frequently in social media text overall, but are very common in rumours about coronavirus. These include “coronavirus”, “corona”, “covid”, “virus”. Several others were removed that were determined by colleagues to not be significant for this analysis, such as “people”, “could”, and “said”.

This graph shows the most frequent words in social media rumour text for all language groups, prior to lemmatizing the rumours first.

While both graphs had the same words removed from the original text, the word frequencies are different for the lemmatized text because all forms of words like “infect”, “prevent”, and “spread”, are counted as one. 

Word frequencies can be especially helpful within a subcategory. For example, here are the word frequency graphs for the sub-categories of “treatment/cure” and “preventative”.

Stark differences can be seen between the different themes — for example, several names of drugs are quite prevalent among treatment rumours like hydroxychloroquine and Fabunan. When I noticed the frequency of hydroxychloroquine, I plotted mentions of the drug in rumours over time and found that rumours about hydroxychloroquine spiked on days in March when President Trump promoted the drug. We included this in our bulletin this week.

1: March 19, 2020 – Trump says he is pushing the Food and Drug Administration to approve treatments for coronavirus patients and calls chloroquine a “game changer” https://www.washingtonpost.com/business/2020/03/19/trump-calls-anti-malarial-drug-game-changer-coronavirus-fda-says-it-needs-study/

2: March 21, 2020 – Trump tweets that hydroxychloroquine and azithromycin, taken together, “have a real chance to be one of the biggest game changers in the history of medicine” https://twitter.com/realDonaldTrump/status/1241367239900778501?s=20

3: April 4, 2020 – In a press briefing, Trump announces that the U.S. government has begun distributing stockpiled hydroxychloroquine to labs and hospitals across the country. “What do you have to lose?” he said. https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-20/

“Mask” is such a prevalent word among rumours about the prevention of coronavirus, so in our analysis this week, I dove deeper into rumours about masks. In these ways, frequencies of words in social media rumours can drive effective and responsive analysis. 

Further, breaking down word counts across language groups can be helpful in determining rumours that are trending in specific regions.

“Wuhan” is the most prevalent word in Chinese rumour data, while it does not appear nearly as common among rumours in Thai, Indonesian, Vietnamese, and Tagalog.

As the world changes rapidly in response to coronavirus, we expect that rumours will evolve in response to changing events, new guidelines by public health officials, and reports from the press. Please don’t hesitate to get in touch if your organization is interested in collaboration or has tips on rumours to look for in our collection of social media data.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: