Yjfox.github.io

Jun Yin's Personal Blog

Follow me onGitHub

Information Retrieval - VSM vs LM

Some tips for VSM & LM when implementing them


VSM vs LM:

  • VSM: Vector Space Model, use TF-IDF as weight
  • LM: Language Model, based on Probability foundation and Naive Bayesian Model

Recently I am coding a program in python that can collect a given user's bookmark information, and then analyze the incoming webpage to find out whether it fits the user's flavor (similar with his/her bookmark).
I chose TFIDF+VSM model, it works good, but I am thinking about if LM(Probability) can work well. As based on naive bayesian theory, we need to calculate tons of Probability.
For example, if I want calculate the similarity between one webpage to another, may need to calculate all terms in the given webpage. If I choose top 300 terms with high frequency, then get P(t1,t2,t3...t300|W). It will be a nightmare when doing probability calculation as there will be 300 fraction, even we do all calculation in Double format, the computer for now still cannot get a high accuracy.
So I just think of a new way to figure out this problem - use LOG as following example

result += math.log(float(training data number) / total)

instead of

result *= float(training data number) / total

Also, as some probability will be 0 (no term appears), we may need to add 1 for all terms no matter it appears or not to avoid any probability be '0'.
As 0 will influence the result significantly. I think it is a kind of Laplace Transform

end. 2014-02-14

blog comments powered by Disqus