Topic Modeling on the Super User Forum

Categories: nlp

Lets imagine you are given a dump of a user forum , or of someone’s email, how do you analyze the contents?

One way might be to look at top words in the whole corpus.

However, looking at word counts without context might not be too meaningful. Words that co-occur together might signal a theme/concept/topic.

In the world of natural language processing, there is technique called Latent Dirichlet Allocation.

From a high level, LDA assumes a document contains a bunch of topics and each topic contains a bunch of words that co-occur.

Below is my analysis on the SuperUser forum using the amazing python tool gensim.

SuperUser Analysis

I asked gensim to build a model that tried to find 8 topics. Here are the topics it learned.

Topics

Here is are 8 topics it learned.

  • file,files,text,like
  • network, server,router,connection
  • windows, screen , problem, time
  • card, device, usb, driver
  • windows user, folder, account
  • drive, windows, disk, boot
  • would, like, data, excel
  • error, command, file, root

Looking at the list of top words, it can be difficult to assign a label.

Labels

After looking at just the words, it can be difficult to assign a label.

One way to identify the labels, is look at documents that score the highest for a topic and assign a label. After looking at several sample docs, here are the labels and sample question titles for the strongest docs.

New instance

Now that we have a topic model, lets see how it works for identifying topics for an unseen document.

{% blockquote outlook separation of 2 email accounts http://superuser.com/questions/991960/outlook-separation-of-2-email-accounts %} I have 2 email accounts on outlook : account A and account B.

When i send a message or receive any msg on account B, the same msg appears on account A. As far as account A is concerned, when i send any msg from this account or when i receive sth, it doesn’t appear on the account B.

i would like separate these 2 accounts so that they work independently.

I was looking for an answer and i found that it shouldn’t be set on outlook but on http://www.windowslive.fr/livemail/ .

If you have any suggestions how to do it or any advice, I would be very greatful! {% endblockquote %}

Result:

Outlook/Email Woes (0.85) Excel/Word hacks (0.08) Video Woes (0.05)

From the result, we see that the strongest topic (Email) best describes the email.

However, we see that the two other strongest topics don’t really describe the document.

This is most likely due to the model not trained enough or me coming from the label.

Visualizations

I have deployed an instance of the site here.

Visualize all the topics and the top 4 words in them {% asset_img topics.png [Topics] %}

Visualize the words in a topic as a word cloud, and top docs for a topic {% asset_img topic.png [Analysis on topic] %}

Visualize the distribution of topics cross the corpus and similarities among topics {% asset_img pca.png [Topic PCA] %}

Analyze a new document {% asset_img analyze.png [Analyze new topic] %}

Things I need to learn more

Evaluating Model

How to evaluate if the model has converged. How to evaluate if the number of topic is correct. One approach seems to be Topic perplexity.

Visualizing topics

Ldaviz implements some nice visualizations for visualizing the topic space. Need to understand that model better.

There are also other options

Links

Deployed Site Slide Code



References

gensim pyLDAvis