check the maximum count of all terms for the supplied document.
Compute the count for the sequence of tokens found in each document.
Compute the count for the sequence of tokens found in each document. Each record in the outer sequence is considered a document. Each inner sequence is considered the collection of tokens within the document.
(termIndexMap, TF-IDF Matrix) (mutable.Map[Int, (String, Int, Int)], DenseMatrix[Double]) The return is term index map that indicates which column each term is mapped to. The map key contains the index of the column and the value corresponds to the term and its hashcode (columnIndex x (Term x Hashcode x Number of Documents containing term)) The second item in the tuple is the TF-IDF matrix. Each row represents a document, each column contains the TF-IDF for the corresponding term within the document.
count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.
count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.
count the number of occurances of the term in each document.
update the counts and return the tuple of total and document counts
update the count of terms.
find the matching index for the supplied term.
term -> count
term -> document x count
The TF IDF count will create a matrix containing tfidf counts for each document.
It will require two steps, the first it will compute the number of occurances per term in each document. The second will compute the total number of occurances for each term in all documents.
Documents are not explicitly listed by id, but their index in the supplied list is treated as the document id.
Compute the tf-idf
as per the wikipedia article.
https://en.wikipedia.org/wiki/Tf–idf
For the term frequency we use the augmented frequency to prevent bias towards longer documents
$$ tf(t, d) = 0.5 + 0.5 \times \frac{f_{t,d}}{max\left(f_{t',d} : t' \in d\right)} $$
For the inversse document frequency it is calculated using the logarithm of
$$ idf(t, D) = \log{\left[ \frac{N}{\left| d \in D : t \in d \right|} \right]} $$
Created by cd on 7/1/17.