Latent Dirichlet Allocation, also known as LDA, is one of the most popular methods for topic modelling. Using LDA, we can easily discover the topics that a document is made of. LDA assumes that the documents are a mixture of topics and each topic contain a set of words with certain probabilities.
For example, consider the below sentences:
- Apple and Banana are fruits.
- I bought a bicycle recently. In less than two years, I will buy a bike.
- The colour of the apple and bicycle are red.
The output of LDA would look like this:
Topic 1: 0.7*apple + 0.3*banana
Topic 2: 0.6*bicycle + 0.4*bike
Sentence 1: [(Topic 1, 1), (Topic 2, 0)]
Sentence 2: [(Topic 1, 0), (Topic 2, 1)]
Sentence 3: [(Topic 1, 0.5), (Topic 2, 0.5)]
Please note that the above probabilities are made up numbers for intuition.
Topic 1: 0.7*apple + 0.3*banana
Topic 2: 0.6*bicycle + 0.4*bike
Sentence 1: [(Topic 1, 1), (Topic 2, 0)]
Sentence 2: [(Topic 1, 0), (Topic 2, 1)]
Sentence 3: [(Topic 1, 0.5), (Topic 2, 0.5)]
Please note that the above probabilities are made up numbers for intuition.
To extract the topics and probability of words using LDA, we should decide the number of topics (k) beforehand. Based on that, LDA discovers the topic distribution of documents and cluster the words into topics. Let us understand how does LDA work.
The below explanation for LDA modeling is taken from the blog: Introduction to Latent Dirichlet Allocation. I have included the same here in simple terms.
Let’s say we have k topics.
- Assign a topic randomly to each word in every document. We have a topic distribution of documents and clustered words into topics. But, this distribution is purely random. To improve this, we
- Iterate through each word in all the documents. Stop at each word and reassign a new topic to each word by calculating the following two probabilities for each topic at that word i.e., P(topic/document) and P(word/topic).
P(topic/document) = Proportion of words that are assigned to this topic in the document.
P(word/topic) = Proportion of the assignments to the topic “t” over all documents for this word. - Reassign a new topic for this word for which P(t/d) * P(w/t) is maximum. Here, we are assuming that all the assignments are correct except for the current assignment.
- Repeat the above two steps until the topic assignments become stable.
Now, we have a topic distribution for each document and the words are clustered into topics.
I hope you got the intuition of how LDA work. Let us code it in Python. We will use Gensim library to use LDAModel. Let's start by importing libraries.
# Import libraries import gensim from gensim import corpora
I have created a list of documents after removing stopwords from the above sentences. Using corpora, I have mapped words into integers.
# Create data documents = [['apple', 'banana', 'fruits'], ['bought', 'bicycle', 'recently', 'less', 'two', 'years', 'buy', 'bike'], ['colour', 'apple', 'bicycle', 'red']] mapping = corpora.Dictionary(documents) data = [mapping.doc2bow(word) for word in documents]
Let us print data and see how does it look like. The words will be mapped to unique integers (Bag Of Words representation of document).
# Print data
data
[[(0, 1), (1, 1), (2, 1)],
[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
[(0, 1), (3, 1), (11, 1), (12, 1)]]
[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
[(0, 1), (3, 1), (11, 1), (12, 1)]]
Now, we will train using LdaModel from Gensim. The inputs should be data, number_of_topics, mapping (id to word), number_of_iterations (passes).
# Train LDA model ldamodel = gensim.models.ldamodel.LdaModel(data, num_topics=2, id2word=mapping, passes=15)
The model has been trained. Let us see the topic distribution of words.
# Show topics topics = ldamodel.show_topics() print(topics)
[(0, '0.167*"apple" + 0.154*"banana" + 0.154*"fruits" + 0.054*"colour" + 0.054*"red" + 0.053*"bicycle" + 0.052*"less" + 0.052*"bought" + 0.052*"recently" + 0.052*"years"'), (1, '0.136*"bicycle" + 0.082*"buy" + 0.082*"bike" + 0.082*"two" + 0.082*"years" + 0.082*"recently" + 0.082*"bought" + 0.082*"less" + 0.081*"red" + 0.081*"colour"')]
This is how the words are distributed into topics with a probability. If we observe carefully, the Topic 1 is about fruits and Topic 2 is about vehicles. Try with large datasets and analyze the topic distribution. It will be fun!!
# Distribution of topics for the first document print(ldamodel.get_document_topics(data[0]))
[(0, 0.8676003), (1, 0.13239971)]
If we see the topic distribution of the document, it is about fruits (Topic 1).
Please try with different datasets and analyze the output. Post your results in the comment section below.
This is the end of the article. Thank you so much for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog. If you have any queries, please do comment in the comment section below. I will be more than happy to help you. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
At first sight, this theme seemed very complicated to me, but, when I saw the examples, I realized that it will be easier. Thank you for this.
ReplyDeleteThank you for your appreciation.
ReplyDeleteIt's very nice of you to share your knowledge through posts. I love to read stories about your experiences. They're very useful and interesting. I am excited to read the next posts. I'm so grateful for all that you've done. Keep plugging. Many viewers like me fancy your writing. Thank you for sharing precious information with us. Best ftp ports service provider.
ReplyDeleteYour post is really good thanks for sharing these kind of post but if anyone looking for Best Consulting Firm for Fake Experience Certificate Providers in bangalore, India with Complete Documents So Dreamsoft Consultancy is the Best Place.Further Details Here- 9599119376 or VisitWebsite-https://experiencecertificates.com/experience-certificate-provider-in-bangalore.html
ReplyDeleteNeed some help in training and testing LDA model.
ReplyDeleteWaiting for your response
With most machines, nevertheless, the proprietor paid off winning clients in drinks or cigars or typically within the form of trade checks that presumably be} exchanged for refreshments. By 1888 machines that paid off in cash were in existence. The machine pays off by dropping into a cup or trough from two to all of the cash within the machine, depending on how and the way lots of the symbols line up when the rotating reels come to rest. Symbols traditionally used embrace stars, card fits, bars, numbers , varied pictured fruits—cherries, plums, oranges, lemons, and watermelons—and the words jackpot 카지노사이트 and bar.
ReplyDelete