Class

au.id.cxd.text.count

TfIdfCount

Related Doc: package count

Permalink

case class TfIdfCount() extends DocumentTermVectoriser with Product with Serializable

The TF IDF count will create a matrix containing tfidf counts for each document.

It will require two steps, the first it will compute the number of occurances per term in each document. The second will compute the total number of occurances for each term in all documents.

Documents are not explicitly listed by id, but their index in the supplied list is treated as the document id.

Compute the tf-idf

as per the wikipedia article.

https://en.wikipedia.org/wiki/Tf–idf

For the term frequency we use the augmented frequency to prevent bias towards longer documents

$$ tf(t, d) = 0.5 + 0.5 \times \frac{f_{t,d}}{max\left(f_{t',d} : t' \in d\right)} $$

For the inversse document frequency it is calculated using the logarithm of

$$ idf(t, D) = \log{\left[ \frac{N}{\left| d \in D : t \in d \right|} \right]} $$

Created by cd on 7/1/17.

Linear Supertypes
Serializable, Serializable, Product, Equals, DocumentTermVectoriser, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. TfIdfCount
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. DocumentTermVectoriser
  7. AnyRef
  8. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new TfIdfCount()

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def checkMaxCount(idx: Int, cnt: Double, docCounts: Map[Int, Double]): Map[Int, Double]

    Permalink

    check the maximum count of all terms for the supplied document.

  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. def count(data: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

    Permalink

    Compute the count for the sequence of tokens found in each document.

    Compute the count for the sequence of tokens found in each document. Each record in the outer sequence is considered a document. Each inner sequence is considered the collection of tokens within the document.

    returns

    (termIndexMap, TF-IDF Matrix) (mutable.Map[Int, (String, Int, Int)], DenseMatrix[Double]) The return is term index map that indicates which column each term is mapped to. The map key contains the index of the column and the value corresponds to the term and its hashcode (columnIndex x (Term x Hashcode x Number of Documents containing term)) The second item in the tuple is the TF-IDF matrix. Each row represents a document, each column contains the TF-IDF for the corresponding term within the document.

    Definition Classes
    TfIdfCountDocumentTermVectoriser
  8. def countQuery(query: Array[String], lsi: LatentSemanticIndex): DenseVector[Double]

    Permalink

    count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.

    count the single query array and create a vector that assigns the tfidf term weights for the query into the indices of the term matrix that was constructed form the lsi model.

    Definition Classes
    TfIdfCountDocumentTermVectoriser
  9. def countTermDocument(row: Array[String], idx: Int, termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Map[Int, Double]], Map[Int, Double])

    Permalink

    count the number of occurances of the term in each document.

  10. def countTermRow(row: Array[String], idx: Int, terms: Map[String, Double], termDocs: Map[String, Map[Int, Double]], docCounts: Map[Int, Double]): (Map[String, Double], Map[String, Map[Int, Double]], Map[Int, Double])

    Permalink

    update the counts and return the tuple of total and document counts

  11. def countTerms(row: Array[String], terms: Map[String, Double]): Map[String, Double]

    Permalink

    update the count of terms.

  12. val docMaxCount: Map[Int, Double]

    Permalink
  13. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. def findIndex(colTermMap: Map[Int, (String, Int, Int)], term: String): Option[(Int, (String, Int, Int))]

    Permalink

    find the matching index for the supplied term.

  16. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  17. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  18. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  20. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  21. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  22. val termCount: Map[String, Double]

    Permalink

    term -> count

  23. val termDocumentCount: Map[String, Map[Int, Double]]

    Permalink

    term -> document x count

  24. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  25. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from DocumentTermVectoriser

Inherited from AnyRef

Inherited from Any

Ungrouped