Term Frequency - Inverse Document Frequency
TF-IDF is the acronym for Term Frequency - Inverse Document Frequency.
A commonly used technique for information retrieval in text-based documents. TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is based on two main factors:
- Term Frequency: The number of times a particular word (or term) appears in a document.
- Inverse Document Frequency: The logarithmically scaled inverse fraction of the number of documents in the corpus that contain the word.
The idea behind TF-IDF is that if a word appears frequently in a document, but appears in a few other documents in the corpus, then that word is likely to be important to that document. Conversely, if a word appears frequently in many documents, it is probably not very important for distinguishing between them.
The TF-IDF value for a word in a document is calculated as the product of its TF and IDF values. The resulting TF-IDF score gives a high weight to terms that are frequent in the document but rare in the corpus, and a low weight to terms that are frequent in the corpus but not in the document. This helps to identify words that are important for distinguishing between documents and for information retrieval purposes.
- Abbreviation: TF-IDF