for example if you search for fishes, then document having fish as one of its words will be nice to retrieve them. That's why it is suggested that you use a stemmer so that words in your documents is transferred to its root making retrieval much wider than before
stemmers will take fishes and return it to fish and take your search keyword and stem also then search returning documents you want
there are many types of automatic stemming techniques, i found them to be one of those
- Brute Force
- Suffix Stripping
- Lemmatisation
- Stochastic
- Hybrid
- Affix
- Matching
i didn't find a fast solution on the internet and i think to get a solution i have to dig inside papers published in that field
anyway, in my two projects i was using Ruby and used in both a stemmer belonging to the Affix class
i found two solutions one called porter and another called snowball
i tried both on a small list of 116 terms that i have in my db and results was the same except in these cases
- emotionality -- emotion -- emot
- joy -- joi -- joy
- negativity -- neg -- negat
my opinion is that the result in this small sample is 2 for snowball and 1 for porter
BUT still i can't say which one should i use, that's why i have shown these cases here in that post to see your opinions in that matter
One last point to say is that porter overstemmed negativity and snowball overstemmed emotionality
I am waiting for your comments as i need to take decision and choose one to use in my new project
No comments:
Post a Comment