In the social network analysis part, we explored a model that exploits the links between the entities to help us find the key players in the data. Here, we will focus on the tweet’s text to better understand what the users are talking about. We move away from the network model we’ve used previously and discuss other methods for text analysis. We first explore topic modeling, an approach that finds natural topics within the text. We then move on to sentiment analysis, the practice of associating a document with a sentiment score
Finding topics
The data we collected from Twitter is a relatively small sample, but attempting to read each individual tweet is a hopeless cause. A more reachable goal is to get a high-level understanding of what users are talking about. One way to do this is by understanding the topics the users are discussing in their tweets. In this section we discuss the automatic discovery of topics in the text through topic modeling with Latent Dirichlet allocation (LDA), a popular topic modeling algorithm.
Every topic in LDA is a collection of words. Each topic contains all of the words in the corpus with a probability of the word belonging to that topic. So, while all of the words in the topic are the same, the weight they are given differs between topics.
LDA finds the most probable words for a topic, associating each topic with a theme is left to the user.
LDA with Gensim
To perfom the LDA computation in Python, we will use the gensim
library
(topic modeling for human). As we can see,
most of the work is done for us, the real effort is in the preprocessing of the
documents to get the documents ready. The preprocessing we will perfom are:
-
Lowercasing - Strip casing of all words in the document
(i.e:
"@thevoiceafrique #TheVoiceAfrique est SUPERB! :) https://t.co/2ty"
becomes"#thevoiceafrique est superb! :) https://t.co/2ty"
) -
Tokenizing - Convert the string to a list of tokens based on whitespace.
This process also removes punctuation marks from the text. This becomes the list
["@thevoiceafrique", "#thevoiceafrique", "est" "superb", ":)", "https://t.co/2ty"]
-
Stopword Removal - Remove stopwords, words so common that their presence
does not tell us anything about the dataset. This also removes smileys, emoticons,
mentions hashtags and links:
["@thevoiceafrique", "#thevoiceafrique", "superb"]
import re import string import numpy as np import emoji from twitter.parse_tweet import Emoticons from pymongo import MongoClient from nltk.corpus import stopwords from nltk.tokenize import TweetTokenizer from gensim.models import LdaModel from gensim.corpora import TextCorpus np.random.seed(42) host = "localhost" port = 27017 db = MongoClient(host, port).search
The stopwords-fr.txt
file is downloaded here.
stop_words = set() stop_words.update(list(string.punctuation)) stop_words.update(stopwords.words("french")) stop_words.update(Emoticons.POSITIVE) stop_words.update(Emoticons.NEGATIVE) stop_words.update(["’", "…", "ca", "°", "çà", "»", "«", "•", "the", "voice", "afrique", "voix", "–", "::", "“", "₩", "🤣"]) with open("data/stopwords-fr.txt") as f: stop_words.update(map(str.strip, f.readlines())) tokenize = TweetTokenizer().tokenize
Little helpers
def parse(text): text = text.strip() text = text.strip("...") found = emoji.demojize(text).split(" ") text = " ".join([t for t in found if not("_" in t)]) text = " ".join(re.split(r"\w*\d+\w*", text)).strip() tokens = tokenize(text) for token in tokens: cond = (token.startswith(("#", "@", "http", "www")) or "." in token or "'" in token ) if not(cond): yield token def preprocess(text): text = text.lower() for token in parse(text): if not(token in stop_words): yield token class Corpus(TextCorpus): def __len__(self): return len(self.input) def get_texts(self): for tweet in self.input: tweet = preprocess(tweet) yield list(tweet)
Load the tweets.
tweets = [tweet["text"] for tweet in db.thevoice.find() if not("retweeted_status" in tweet.keys())]
Enrich the stopwords set.
regexp = emoji.get_emoji_regexp().findall for tweet in tweets: stop_words.update(regexp(tweet))
Build the corpus.
corpus = Corpus(tweets) print("Number of documents: {}\nNumber of tokens: {}".format(len(corpus), len(corpus.dictionary)))
Build the model.
lda = LdaModel(corpus, num_topics=5, id2word=corpus.dictionary)
A helper for printing the topics
def show_topics(n=5, n_words=10, fmt="simple"): """Show `n` randomly selected topics and thier top words. """ from tabulate import tabulate topics = {} ids = np.arange(lda.num_topics) ids = np.random.choice(ids, n, replace=False) for i in ids: topic = lda.show_topic(i, n_words) words,prop = zip(*topic) topics[i+1] = list(words) tabular = tabulate(topics, headers="keys", tablefmt=fmt) print(tabular)
Show the topics
show_topics()
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
singuila | gars | chante | asalfo | fire |
coachs | lokua | nadia | shayden | famille |
chante | charlotte | pub | singuila | faut |
lol | go | chanson | grâce | vrai |
congolais | soir | grace | deh | retourne |
asalfo | asalfo | choix | belle | faire |
charlotte | super | candidats | talent | pro |
talent | déjà | belle | soir | coach |
albert | ndem | heroine | ans | nadia |
frère | chante | soirée | soeur | gars |
The table above show the distribution of words within the different topics. From that, we can see that viewers are talking about the different candidates and coaches. In the next post, we will use Sentiment Analysis to see if we see what sentiment is the most present in the data.
Thanks for following.