Differential Diagnosis

Medical / Technology / Education feed

Using the right tools for the job

Since this blog has been up I've fiddled with some text analysis stuff by analysing the text and making recommendations for similar blog entries. Did it all in PHP and MySQL just to understand how the algorithms work. Eventually it started to take about 5 hours to:

  • tokenise and stemming the text
  • calculate the tf and idf values
  • calculate the document similarity using cosine similarity

It involved 3 home grown PHP classes and hundreds of lines of code. (Even did Latent Dirichlet Allocation and Gibbs sampling in PHP for a laugh but that's another story).

Thought this must be easier in python. Turns out it takes just 5 seconds and about 12 lines of code. Rediscovering python has been great and it really is the right tool for the job.

Then added in the whole (not stemmed) keywords but foolishly iterated over an entire numpy array (2000 x 25000) looking at every value when I should have been using ndenumerate() ... what took 10 minutes now takes 20 seconds.

php blog text using right similarity numpy ndenumerate