Thinking Allowed

medical / technology / education / art / flub

Understanding Latent Dirichlet Allocation with Gibbs Sampling by coding it from scratch. Latent Dirichlet Allocation (LDA)

Understanding Latent Dirichlet Allocation with Gibbs Sampling by coding it from scratch.

Latent Dirichlet Allocation (LDA) is a machine-learning technique that by the magic of many (many many) small calculations it can detect patterns in data and cluster documents, for example, into similar topics.

I used LDA previously at OnExamination to give an experimental function for finding similar MCQs, easily detecting duplicates (or very near copies) in the database, and to attempt to automatically classify new questions to topics (didn't get that to happen). At that time I used the C++ code for GibbsLDA++ that is available on Sourceforge and adapted it slightly with a python pre-processor. It had to run offline and that was achieved by downloading the data in a format that the code could use and then uploading the model data back for the website to use.

However, I was aware I didn't really understand exactly how the algorithm worked so had a mini battle to construct one in PHP/MySQL and got it to work. Yes I know it is clearly the wrong language to do it in but the principle was the challenge. The article by Ethen Liu really helped. Why not use a language like PHP without any built in matrix or sampling functions - I thought. Obviously it only works on very small collections of documents but by coding something you really get to understand what exactly is happening and why. A good example of 'active learning'. The main lesson learnt ... C++ is definitely a lot faster for this task than PHP.

Source: https://ethen8181.github.io/machine-learning/clustering_old...

php lda allocation dirichlet really sampling coding documents