Beginner’s Guide To Latent Dirichlet Allocation

It is relatively easy for humans to learn a language. With years of practice subconsciously, we pick up nuances and stack up to the sophistication with the help of localised cultural cues. We have this complex mechanism where we meticulously derive deep meanings with the help of very few words.

For machines, which operate on inferences of binary nature, human language is almost an impossible task.

One way to do it is by predetermining the groups to which certain words belong to, segregating the useful words from stop words and appending a score to the relationship between two words in a sentence.

Latent Dirichlet Allocation (LDA) is one such technique designed to assist in modelling the data consisting of a large corpus of words. There is some terminology that one needs to be familiar with, to understand LDA:

Document: Probability distributions over latent topics

Topic: Probability distributions over words.

The word ‘topic’ refers to associating a certain word to with a definition. For instance, when the machine reads-horse is black, it tokenizes the sentence and comes to the conclusion that there are two topics; horse which is an animal and black, a colour.

Plate Notation: For visually representing dependencies among the model parameters.

How Does LDA Work

What LDA actually does is topic modelling. It is an unsupervised algorithm used to spot the semantic relationship between words a group with the help of associated indicators.

When a document needs modelling by LDA, the following steps are carried out initially:

The number of words in the document are determined.
A topic mixture for the document over a fixed set of topics is chosen.
A topic is selected based on the document’s multinomial distribution.
Now a word is picked based on the topic’s multinomial distribution.

This visualization by David Lettier, serves as a very good representation of the distribution of a certain topic in a document. The edges or the apex points indicate that the probability of some word to belong to a topic reduces to null.

Looking At LDA From The Other End

LDA can be made to go backwards as well:

First, each word in each document is randomly assigned to one of the topics.
Now, it is assumed that all topic assignments except for the current one are correct.
The proportion of words in document say, ‘d’ that are currently assigned to topic ‘t’ is equal to p(topic t | document d) and proportion of assignments topic ‘t’ over all documents that belong to word ‘w’ is equal to p(word w | topic t).
These two proportions are multiplied and assigned a new topic based on that probability.

LDA assumes that words in each document are related. Then after running through the aforementioned steps, it figures out how a certain might have been created. And, this very solution will be used to generate topic and word distributions over a corpus.

LDA Implementation In Python

Step 1: Initialising hyperparameters in LDA with alpha = 0.2 & beta = 0.001

# Text corpus iterations

corpus_iter = 200 K = 2 V = len(vocab_total) D = len(text_ID) word_topic_count = np.zeros((K,V)) topic_doc_assign = [np.zeros(len(sublist)) for sublist in text_ID] doc_topic_count = np.zeros((D,K))

Step 2: Generate word-topic count matrix with randomly assigned topics

for doc in range(D): for word in range(len(text_ID[doc])): topic_doc_assign[doc][word] = np.random.choice(K,1) word_topic = int(topic_doc_assign[doc][word]) word_doc_ID = text_ID[doc][word] word_topic_count[word_topic][word_doc_ID] += 1 print('Word-topic count matrix with random topic assignment: \n%s' % word_topic_count)

Output:

Word-topic count matrix with a random topic assignment:
[[ 1. 0. 2. …, 5. 0. 0.]
[ 0. 1. 0. …, 7. 1. 1.]]

Check the full code here.

Conclusion

Latent Dirichlet allocation was introduced back in 2003 to tackle the problem of modelling text corpora and collections of discrete data. Initially, the goal was to find short descriptions of smaller sample from a collection; the results of which could be extrapolated on to larger collection while preserving the basic statistical relationships of relevance.

Apart from detecting topics in texts and doing sentiment analysis, LDA has also found its application in Bioinformatics, harmonic analysis for music and even object localisation for images.

The post Beginner’s Guide To Latent Dirichlet Allocation appeared first on Analytics India Magazine.

Beginner’s Guide To Latent Dirichlet Allocation

How Does LDA Work

Looking At LDA From The Other End

LDA Implementation In Python

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112