Object/Class

au.id.cxd.text.model

LatentSemanticIndex

Related Docs: class LatentSemanticIndex | package model

Permalink

object LatentSemanticIndex extends LatentSemanticIndexWriter with LatentSemanticIndexReader with LatentSemanticIndexBuilder with LsiDocumentSearch with Serializable

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. LatentSemanticIndex
  2. Serializable
  3. Serializable
  4. LsiDocumentSearch
  5. LatentSemanticIndexBuilder
  6. LatentSemanticIndexReader
  7. LatentSemanticIndexWriter
  8. AnyRef
  9. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. type LsiSearchSpace = (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def apply(docIdMap: Map[Int, Seq[String]], colTermMap: Map[Int, (String, Int, Int)], tfIdf: DenseMatrix[Double], svD: SVD[DenseMatrix[Double], DenseVector[Double]]): LatentSemanticIndex

    Permalink
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def buildFromCsv(inputCsv: String, docIdCols: Seq[Int], skipHeader: Boolean = true, stemTerms: Boolean = true): (Double, DenseVector[Double], LatentSemanticIndex)

    Permalink

    recompute the entire latent semantic index from the data stored in the supplied CSV file.

    recompute the entire latent semantic index from the data stored in the supplied CSV file. The CSV file must include one or more document ids for each row and must include the document text content in the remaining column.

    Eg: DocID1,SubID2,Text 1, 111 , "This is some text data"

    The fields should be quoted either with " or '.

    returns

    The return result includes the latent semantic index, along with the entropy of the terms calculated from the data set and the contributions of each singular component to the total entropy of the document set.

    Definition Classes
    LatentSemanticIndexBuilder
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def computeSvd(tfIdf: DenseMatrix[Double]): (DenseSVD, Double, DenseVector[Double])

    Permalink

    compute the Svd along with the entropy of the data set and the contributions of each singular value component.

    compute the Svd along with the entropy of the data set and the contributions of each singular value component.

    Definition Classes
    LatentSemanticIndexBuilder
  9. def computeTfIdf(terms: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

    Permalink

    compute the TfIdf matrix

    compute the TfIdf matrix

    Definition Classes
    LatentSemanticIndexBuilder
  10. val docMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

    Permalink
    Definition Classes
    LatentSemanticIndexWriter
  11. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  13. def extractAndReadFromPath(zipFile: String)(path: String): Try[LatentSemanticIndex]

    Permalink

    read the SVD from the supplied path.

    read the SVD from the supplied path.

    Definition Classes
    LatentSemanticIndexReader
  14. def extractStemmedTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

    Permalink

    extract stemmed terms from the supplied sequence of lines

    extract stemmed terms from the supplied sequence of lines

    Definition Classes
    LatentSemanticIndexBuilder
  15. def extractTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

    Permalink

    extract terms without stemming.

    extract terms without stemming.

    Definition Classes
    LatentSemanticIndexBuilder
  16. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  17. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  18. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  19. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  20. def loadStopWords(): Seq[String]

    Permalink

    load the default set of stopwords using the embedded stopwords loader.

    load the default set of stopwords using the embedded stopwords loader.

    Definition Classes
    LatentSemanticIndexBuilder
  21. def makeSearchSpace(lsi: LatentSemanticIndex): (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

    Permalink

    generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.

    generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.

    Note the search space may need to be cached

    Definition Classes
    LsiDocumentSearch
  22. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  23. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  24. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  25. def performSearch(searchSpace: (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double]), vectoriser: DocumentTermVectoriser, query: Array[String], stopWords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): Buffer[(Int, Double, Seq[String])]

    Permalink

    search in the lsi model with an array query.

    search in the lsi model with an array query.

    Definition Classes
    LsiDocumentSearch
  26. def preprocessQuery(vectoriser: DocumentTermVectoriser)(query: Array[String], stopwords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): DenseVector[Double]

    Permalink

    convert the query into a term vector.

    convert the query into a term vector.

    lsi

    : LatentSemanticIndex

    returns

    the query term vector. currently the term vector is a tfidf vector for the query based on the lsi model. However there are other methods of weighting terms so it will be changed to supply a counting trait to calculate the term weights

    Definition Classes
    LsiDocumentSearch
  27. def readBinary(filePath: String): Option[LatentSemanticIndex]

    Permalink

    read from binary

    read from binary

    Definition Classes
    LatentSemanticIndexReader
  28. def readColTermMap(path: String): Map[Int, (String, Int, Int)]

    Permalink

    read the column term map from CSV.

    read the column term map from CSV.

    Definition Classes
    LatentSemanticIndexReader
  29. def readDocMap(path: String): Map[Int, Seq[String]]

    Permalink

    read the document index map

    read the document index map

    Definition Classes
    LatentSemanticIndexReader
  30. def readIndexedCsv(path: String, docIdCols: Seq[Int], skipHeader: Boolean = true): (Map[Int, Seq[String]], ListBuffer[String])

    Permalink

    read an indexed CSV file containing document Ids for each row and string text for each document.

    read an indexed CSV file containing document Ids for each row and string text for each document.

    Definition Classes
    LatentSemanticIndexBuilder
  31. def readZip(zipFile: String): Try[LatentSemanticIndex]

    Permalink

    read from the zip archive.

    read from the zip archive.

    Definition Classes
    LatentSemanticIndexReader
  32. def reduceToDimensons(lsi: LatentSemanticIndex, k: Int): LatentSemanticIndex

    Permalink

    The singular values can dictate how many dimensions we should retain.

    The singular values can dictate how many dimensions we should retain. After manual analysis it may be decided that we only want to retain a certain fixed number of dimensions

    Hence this will result in an SVD of reduced dimensionality where k is the number of principle components.

    The original SVD

    $$ \hat{X} = U S V' $$

    where $U$ has dimension $(m x n)$ and $S$ has size $n$ and $Vt$ has dimension $(n x n)$

    If we choose dimension $k < n$ then we have

    $U$ dimension $(m x k)$ $S$ dimension $k$ $V'$ dimension $(k x n)$

    Note that this does not reduce the original tfidf matrix, but will reduce the dimensions that the projects of the tfidf into the search space will have.

    Note that $k < n$

    Definition Classes
    LsiDocumentSearch
  33. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  34. val termMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

    Permalink
    Definition Classes
    LatentSemanticIndexWriter
  35. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  36. def unapply(lsi: LatentSemanticIndex): (Map[Int, Seq[String]], Map[Int, (String, Int, Int)], DenseMatrix[Double], SVD[DenseMatrix[Double], DenseVector[Double]])

    Permalink
  37. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  39. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  40. def writeBinary(lsi: LatentSemanticIndex)(path: String): Option[Boolean]

    Permalink

    write the model in binary format instead of a zip archive.

    write the model in binary format instead of a zip archive.

    Definition Classes
    LatentSemanticIndexWriter
  41. def writeDocIdMap(path: String, docIdMap: Map[Int, Seq[String]]): Try[Boolean]

    Permalink

    write the document Id Map

    write the document Id Map

    Definition Classes
    LatentSemanticIndexWriter
  42. def writeTermMap(path: String, termMap: Map[Int, (String, Int, Int)]): Try[Boolean]

    Permalink
    Definition Classes
    LatentSemanticIndexWriter
  43. def writeZip(index: LatentSemanticIndex)(workingPath: String)(targetPath: String): Try[Boolean]

    Permalink
    Definition Classes
    LatentSemanticIndexWriter
  44. def writeZipTemp(index: LatentSemanticIndex)(path: String): Try[Boolean]

    Permalink

    write the latent semantic index structure to the supplied path.

    write the latent semantic index structure to the supplied path.

    Definition Classes
    LatentSemanticIndexWriter

Inherited from Serializable

Inherited from Serializable

Inherited from LsiDocumentSearch

Inherited from LatentSemanticIndexReader

Inherited from LatentSemanticIndexWriter

Inherited from AnyRef

Inherited from Any

Ungrouped