Package

count

Permalink

package count

Visibility

Public
All

Type Members

trait DocumentTermVectoriser extends AnyRef

Created by cd on 12/1/17.
case class TfIdfCount() extends DocumentTermVectoriser with Product with Serializable

The TF IDF count will create a matrix containing tfidf counts for each document.
It will require two steps, the first it will compute the number of occurances per term in each document. The second will compute the total number of occurances for each term in all documents.
Documents are not explicitly listed by id, but their index in the supplied list is treated as the document id.
Compute the tf-idf
as per the wikipedia article.
https://en.wikipedia.org/wiki/Tf–idf
For the term frequency we use the augmented frequency to prevent bias towards longer documents
$$ tf(t, d) = 0.5 + 0.5 \times \frac{f_{t,d}}{max\left(f_{t',d} : t' \in d\right)} $$
For the inversse document frequency it is calculated using the logarithm of
$$ idf(t, D) = \log{\left[ \frac{N}{\left| d \in D : t \in d \right|} \right]} $$
Created by cd on 7/1/17.

Ungrouped