Tuesday, February 17, 2009

Which Stemming Algorithm To use ??

suppose you want to search for something in a full database of words, using exact matching is a solution but will miss some data you may need

for example if you search for fishes, then document having fish as one of its words will be nice to retrieve them. That's why it is suggested that you use a stemmer so that words in your documents is transferred to its root making retrieval much wider than before

stemmers will take fishes and return it to fish and take your search keyword and stem also then search returning documents you want

there are many types of automatic stemming techniques, i found them to be one of those
  • Brute Force
  • Suffix Stripping
  • Lemmatisation
  • Stochastic
  • Hybrid
  • Affix
  • Matching
This is a long list of algorithms but the question that i didn't find an answer for it was which one to choose and in which cases and why
i didn't find a fast solution on the internet and i think to get a solution i have to dig inside papers published in that field

anyway, in my two projects i was using Ruby and used in both a stemmer belonging to the Affix class
i found two solutions one called porter and another called snowball

i tried both on a small list of 116 terms that i have in my db and results was the same except in these cases
  • emotionality -- emotion -- emot
  • joy -- joi -- joy
  • negativity -- neg -- negat
the first is the actual word, the second is porter result, the third is snowball result
my opinion is that the result in this small sample is 2 for snowball and 1 for porter
BUT still i can't say which one should i use, that's why i have shown these cases here in that post to see your opinions in that matter

One last point to say is that porter overstemmed negativity and snowball overstemmed emotionality

I am waiting for your comments as i need to take decision and choose one to use in my new project

No comments: