Using the right tools for the job
Since this blog has been up I've fiddled with some text analysis stuff by analysing the text and making recommendations for similar blog entries. Did it all in PHP and MySQL just to understand how the algorithms work. Eventually it started to take about 5 hours to:
- tokenise and stemming the text
- calculate the tf and idf values
- calculate the document similarity using cosine similarity
Thought this must be easier in python. Turns out it takes just 5 seconds and about 12 lines of code. Rediscovering python has been great and it really is the right tool for the job.
Then added in the whole (not stemmed) keywords but foolishly iterated over an entire numpy array (2000 x 25000) looking at every value when I should have been using ndenumerate() ... what took 10 minutes now takes 20 seconds.
php blog text using right similarity numpy ndenumerate