The Voice Afrique Tweets Mining Part 3

In the social network analysis part, we explored a model that exploits the links between the entities to help us find the key players in the data. Here, we will focus on the tweet’s text to better understand what the users are talking about. We move away from the network model we’ve used previously and discuss other methods for text analysis. We first explore topic modeling, an approach that finds natural topics within the text. We then move on to sentiment analysis, the practice of associating a document with a sentiment score

Finding topics

The data we collected from Twitter is a relatively small sample, but attempting to read each individual tweet is a hopeless cause. A more reachable goal is to get a high-level understanding of what users are talking about. One way to do this is by understanding the topics the users are discussing in their tweets. In this section we discuss the automatic discovery of topics in the text through topic modeling with Latent Dirichlet allocation (LDA), a popular topic modeling algorithm.

Every topic in LDA is a collection of words. Each topic contains all of the words in the corpus with a probability of the word belonging to that topic. So, while all of the words in the topic are the same, the weight they are given differs between topics.

LDA finds the most probable words for a topic, associating each topic with a theme is left to the user.

LDA with Gensim

To perfom the LDA computation in Python, we will use the gensim library (topic modeling for human). As we can see, most of the work is done for us, the real effort is in the preprocessing of the documents to get the documents ready. The preprocessing we will perfom are:

Lowercasing - Strip casing of all words in the document (i.e: "@thevoiceafrique #TheVoiceAfrique est SUPERB! :) https://t.co/2ty" becomes "#thevoiceafrique est superb! :) https://t.co/2ty")
Tokenizing - Convert the string to a list of tokens based on whitespace. This process also removes punctuation marks from the text. This becomes the list ["@thevoiceafrique", "#thevoiceafrique", "est" "superb", ":)", "https://t.co/2ty"]
Stopword Removal - Remove stopwords, words so common that their presence does not tell us anything about the dataset. This also removes smileys, emoticons, mentions hashtags and links: ["@thevoiceafrique", "#thevoiceafrique", "superb"]

import re
import string
import numpy as np
import emoji
from twitter.parse_tweet import Emoticons
from pymongo import MongoClient
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from gensim.models import LdaModel
from gensim.corpora import TextCorpus

np.random.seed(42)

host = "localhost"
port = 27017

db = MongoClient(host, port).search

The stopwords-fr.txt file is downloaded here.

stop_words = set()
stop_words.update(list(string.punctuation))
stop_words.update(stopwords.words("french"))
stop_words.update(Emoticons.POSITIVE)
stop_words.update(Emoticons.NEGATIVE)
stop_words.update(["’", "…", "ca", "°", "çà", "»", "«", "•", "the",
                   "voice", "afrique", "voix", "–", "::", "“", "₩", "🤣"])

with open("data/stopwords-fr.txt") as f:
    stop_words.update(map(str.strip, f.readlines()))

tokenize = TweetTokenizer().tokenize

Little helpers

def parse(text):

    text = text.strip()
    text = text.strip("...")
    found = emoji.demojize(text).split(" ")
    text = " ".join([t for t in found if not("_" in t)])
    text = " ".join(re.split(r"\w*\d+\w*", text)).strip()
    tokens = tokenize(text)

    for token in tokens:
        cond = (token.startswith(("#", "@", "http", "www")) or
                "." in token or
                "'" in token
                )

        if not(cond):
            yield token

def preprocess(text):
    text = text.lower()
    for token in parse(text):
        if not(token in stop_words):
            yield token

class Corpus(TextCorpus):

    def __len__(self):
        return len(self.input)

    def get_texts(self):
        for tweet in self.input:
            tweet = preprocess(tweet)
            yield list(tweet)

Load the tweets.

tweets = [tweet["text"] for tweet in db.thevoice.find() if not("retweeted_status" in tweet.keys())]

Enrich the stopwords set.

regexp = emoji.get_emoji_regexp().findall

for tweet in tweets:
    stop_words.update(regexp(tweet))

Build the corpus.

corpus = Corpus(tweets)

print("Number of documents: {}\nNumber of tokens: {}".format(len(corpus), len(corpus.dictionary)))

Build the model.

lda = LdaModel(corpus, num_topics=5, id2word=corpus.dictionary)

A helper for printing the topics

def show_topics(n=5, n_words=10, fmt="simple"):
    """Show `n` randomly selected topics and thier
    top words.
    """
    from tabulate import tabulate

    topics = {}
    ids = np.arange(lda.num_topics)
    ids = np.random.choice(ids, n, replace=False)
    for i in ids:
        topic = lda.show_topic(i, n_words)
        words,prop = zip(*topic)
        topics[i+1] = list(words)

    tabular = tabulate(topics, headers="keys", tablefmt=fmt)

    print(tabular)

Show the topics

show_topics()

1	2	3	4	5
singuila	gars	chante	asalfo	fire
coachs	lokua	nadia	shayden	famille
chante	charlotte	pub	singuila	faut
lol	go	chanson	grâce	vrai
congolais	soir	grace	deh	retourne
asalfo	asalfo	choix	belle	faire
charlotte	super	candidats	talent	pro
talent	déjà	belle	soir	coach
albert	ndem	heroine	ans	nadia
frère	chante	soirée	soeur	gars

The table above show the distribution of words within the different topics. From that, we can see that viewers are talking about the different candidates and coaches. In the next post, we will use Sentiment Analysis to see if we see what sentiment is the most present in the data.

Thanks for following.

Menu

Finding topics

LDA with Gensim