TfIdfCount

The TF IDF count will create a matrix containing tfidf counts for each document.

It will require two steps, the first it will compute the number of occurances per term in each document. The second will compute the total number of occurances for each term in all documents.

Documents are not explicitly listed by id, but their index in the supplied list is treated as the document id.

Compute the tf-idf

as per the wikipedia article.

https://en.wikipedia.org/wiki/Tf–idf

For the term frequency we use the augmented frequency to prevent bias towards longer documents

$$ tf(t, d) = 0.5 + 0.5 \times \frac{f_{t,d}}{max\left(f_{t',d} : t' \in d\right)} $$

For the inversse document frequency it is calculated using the logarithm of

$$ idf(t, D) = \log{\left[ \frac{N}{\left| d \in D : t \in d \right|} \right]} $$

Created by cd on 7/1/17.

Linear Supertypes

Serializable, Serializable, Product, Equals, DocumentTermVectoriser, AnyRef, Any

Instance Constructors

new TfIdfCount()

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def checkMaxCount(idx: Int, cnt: Double, docCounts: Map[Int, Double]): Map[Int, Double]

check the maximum count of all terms for the supplied document.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def count(data: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

Compute the count for the sequence of tokens found in each document.
Compute the count for the sequence of tokens found in each document. Each record in the outer sequence is considered a document. Each inner sequence is considered the collection of tokens within the document.
returns
(termIndexMap, TF-IDF Matrix) (mutable.Map[Int, (String, Int, Int)], DenseMatrix[Double]) The return is term index map that indicates which column each term is mapped to. The map key contains the index of the column and the value corresponds to the term and its hashcode (columnIndex x (Term x Hashcode x Number of Documents containing term)) The second item in the tuple is the TF-IDF matrix. Each row represents a document, each column contains the TF-IDF for the corresponding term within the document.

Definition Classes
TfIdfCount → DocumentTermVectoriser
def countQuery(query: Array[String], lsi: LatentSemanticIndex): DenseVector[Double]

count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.
count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.

Definition Classes
TfIdfCount → DocumentTermVectoriser
def countTermDocument(row: Array[String], idx: Int, termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Map[Int, Double]], Map[Int, Double])

count the number of occurances of the term in each document.
def countTermRow(row: Array[String], idx: Int, terms: Map[String, Double], termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Double], Map[String, Map[Int, Double]], Map[Int, Double])

update the counts and return the tuple of total and document counts
def countTerms(row: Array[String], terms: Map[String, Double]): Map[String, Double]

update the count of terms.
val docMaxCount: Map[Int, Double]
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def findIndex(colTermMap: Map[Int, (String, Int, Int)], term: String): Option[(Int, (String, Int, Int))]

find the matching index for the supplied term.
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
val termCount: Map[String, Double]

term -> count
val termDocumentCount: Map[String, Map[Int, Double]]

term -> document x count
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package count

case class TfIdfCount() extends DocumentTermVectoriser with Product with Serializable

Instance Constructors

new TfIdfCount()

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def checkMaxCount(idx: Int, cnt: Double, docCounts: Map[Int, Double]): Map[Int, Double]

def clone(): AnyRef

def count(data: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

def countQuery(query: Array[String], lsi: LatentSemanticIndex): DenseVector[Double]

def countTermDocument(row: Array[String], idx: Int, termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Map[Int, Double]], Map[Int, Double])

def countTermRow(row: Array[String], idx: Int, terms: Map[String, Double], termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Double], Map[String, Map[Int, Double]], Map[Int, Double])

def countTerms(row: Array[String], terms: Map[String, Double]): Map[String, Double]

val docMaxCount: Map[Int, Double]

final def eq(arg0: AnyRef): Boolean

def finalize(): Unit

def findIndex(colTermMap: Map[Int, (String, Int, Int)], term: String): Option[(Int, (String, Int, Int))]

final def getClass(): Class[_]

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

val termCount: Map[String, Double]

val termDocumentCount: Map[String, Map[Int, Double]]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from DocumentTermVectoriser

Inherited from AnyRef

Inherited from Any

Ungrouped