Information Retrieval - VSM vs LM
Some tips for VSM & LM when implementing them
VSM vs LM:
- VSM: Vector Space Model, use TF-IDF as weight
- LM: Language Model, based on Probability foundation and Naive Bayesian Model
Recently I am coding a program in python that can collect a given user's bookmark information, and then analyze the incoming webpage to find out whether it fits the user's flavor (similar with his/her bookmark).
I chose TFIDF+VSM model, it works good, but I am thinking about if LM(Probability) can work well. As based on naive bayesian theory, we need to calculate tons of Probability.
For example, if I want calculate the similarity between one webpage to another, may need to calculate all terms in the given webpage. If I choose top 300 terms with high frequency, then get P(t1,t2,t3...t300|W). It will be a nightmare when doing probability calculation as there will be 300 fraction, even we do all calculation in Double format, the computer for now still cannot get a high accuracy.
So I just think of a new way to figure out this problem - use LOG as following example
result += math.log(float(training data number) / total)
instead of
result *= float(training data number) / total
Also, as some probability will be 0 (no term appears), we may need to add 1 for all terms no matter it appears or not to avoid any probability be '0'.
As 0 will influence the result significantly. I think it is a kind of Laplace Transform
end. 2014-02-14