Question : What is Term Frequency ?
Answer : Term Frequency refers to the frequency of each term.

Question : What is IDF ?
Answer :   Well with term frequency, in a document, there will be a lot of frequency for words such as a, the. And if we compare the two documents based on this frequency only, then it will yield a poor result, because all the documents  have a, the words with large frequency. What IDF does is does not give more weights to such commonly occurring words.

Question: Interesting. But how does Term Frequency – Inverse Document Frequency (tf-idf) work?
Answer: Term frequency captures the number of terms in a document i.e f(t,d). Hence it is able to capture the importance of words in a document. However, where it fails is with commonly occurring words in the language such as a, an , the. So how do we capture them. A brute force way is to create a dictionary of such commonly occurring words or capture a dictionary with word frequency score, based on its commonality manually annotated and use it. However the dictionary will not be adaptive enough to respond to changes in language and will require human labor, which might not be accurate enough.
A more better statistical approach will be to look at how many document have the word vs the total number of documents. This is called Inverse document Frequency.
idf = idf(t,D)
And to get the best of the both worlds, hence we use the tf-idf.
tfidf(t,d,D) = tf(t,d). idf(t,D)
A high tfidf is obtained by high occurrence of a term t in a document d and low occurrence of the term t across all documents D.

Question : Can tf-Idf be used across other scenarios other than words?
Answer: Yes it can be used on other situation other than simple word weighting. It can be extended to sentence or n-grams. Furthermore Tf-Idf was applied to the citation scenario and for object identification. However  not in all cases are tf-idf found to be better, as  was observed  in case of citation.
As for citation case, the researchers hypothesis was that, if a document has some uncommon citation and a common citation shared by many, then that the uncommon citation should be weighted more. See wiki Tf-Idf for Ref

Advertisements