LatentSemanticIndex

Type Members

type LsiSearchSpace = (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def apply(docIdMap: Map[Int, Seq[String]], colTermMap: Map[Int, (String, Int, Int)], tfIdf: DenseMatrix[Double], svD: SVD[DenseMatrix[Double], DenseVector[Double]]): LatentSemanticIndex
final def asInstanceOf[T0]: T0

Definition Classes
Any
def buildFromCsv(inputCsv: String, docIdCols: Seq[Int], skipHeader: Boolean = true, stemTerms: Boolean = true): (Double, DenseVector[Double], LatentSemanticIndex)

recompute the entire latent semantic index from the data stored in the supplied CSV file.
recompute the entire latent semantic index from the data stored in the supplied CSV file. The CSV file must include one or more document ids for each row and must include the document text content in the remaining column.
Eg: DocID1,SubID2,Text 1, 111 , "This is some text data"
The fields should be quoted either with " or '.
returns
The return result includes the latent semantic index, along with the entropy of the terms calculated from the data set and the contributions of each singular component to the total entropy of the document set.

Definition Classes
LatentSemanticIndexBuilder
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def computeSvd(tfIdf: DenseMatrix[Double]): (DenseSVD, Double, DenseVector[Double])

compute the Svd along with the entropy of the data set and the contributions of each singular value component.
compute the Svd along with the entropy of the data set and the contributions of each singular value component.

Definition Classes
LatentSemanticIndexBuilder
def computeTfIdf(terms: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

compute the TfIdf matrix
compute the TfIdf matrix

Definition Classes
LatentSemanticIndexBuilder
val docMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

Definition Classes
LatentSemanticIndexWriter
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def extractAndReadFromPath(zipFile: String)(path: String): Try[LatentSemanticIndex]

read the SVD from the supplied path.
read the SVD from the supplied path.

Definition Classes
LatentSemanticIndexReader
def extractStemmedTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

extract stemmed terms from the supplied sequence of lines
extract stemmed terms from the supplied sequence of lines

Definition Classes
LatentSemanticIndexBuilder
def extractTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

extract terms without stemming.
extract terms without stemming.

Definition Classes
LatentSemanticIndexBuilder
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def loadStopWords(): Seq[String]

load the default set of stopwords using the embedded stopwords loader.
load the default set of stopwords using the embedded stopwords loader.

Definition Classes
LatentSemanticIndexBuilder
def makeSearchSpace(lsi: LatentSemanticIndex): (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
Note the search space may need to be cached

Definition Classes
LsiDocumentSearch
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def performSearch(searchSpace: (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double]), vectoriser: DocumentTermVectoriser, query: Array[String], stopWords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): Buffer[(Int, Double, Seq[String])]

search in the lsi model with an array query.
search in the lsi model with an array query.

Definition Classes
LsiDocumentSearch
def preprocessQuery(vectoriser: DocumentTermVectoriser)(query: Array[String], stopwords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): DenseVector[Double]

convert the query into a term vector.
convert the query into a term vector.
lsi
: LatentSemanticIndex
returns
the query term vector. currently the term vector is a tfidf vector for the query based on the lsi model. However there are other methods of weighting terms so it will be changed to supply a counting trait to calculate the term weights

Definition Classes
LsiDocumentSearch
def readBinary(filePath: String): Option[LatentSemanticIndex]

read from binary
read from binary

Definition Classes
LatentSemanticIndexReader
def readColTermMap(path: String): Map[Int, (String, Int, Int)]

read the column term map from CSV.
read the column term map from CSV.

Definition Classes
LatentSemanticIndexReader
def readDocMap(path: String): Map[Int, Seq[String]]

read the document index map
read the document index map

Definition Classes
LatentSemanticIndexReader
def readIndexedCsv(path: String, docIdCols: Seq[Int], skipHeader: Boolean = true): (Map[Int, Seq[String]], ListBuffer[String])

read an indexed CSV file containing document Ids for each row and string text for each document.
read an indexed CSV file containing document Ids for each row and string text for each document.

Definition Classes
LatentSemanticIndexBuilder
def readZip(zipFile: String): Try[LatentSemanticIndex]

read from the zip archive.
read from the zip archive.

Definition Classes
LatentSemanticIndexReader
def reduceToDimensons(lsi: LatentSemanticIndex, k: Int): LatentSemanticIndex

The singular values can dictate how many dimensions we should retain.
The singular values can dictate how many dimensions we should retain. After manual analysis it may be decided that we only want to retain a certain fixed number of dimensions
Hence this will result in an SVD of reduced dimensionality where k is the number of principle components.
The original SVD
$$ \hat{X} = U S V' $$
where $U$ has dimension $(m x n)$ and $S$ has size $n$ and $Vt$ has dimension $(n x n)$
If we choose dimension $k < n$ then we have
$U$ dimension $(m x k)$ $S$ dimension $k$ $V'$ dimension $(k x n)$
Note that this does not reduce the original tfidf matrix, but will reduce the dimensions that the projects of the tfidf into the search space will have.
Note that $k < n$

Definition Classes
LsiDocumentSearch
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
val termMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

Definition Classes
LatentSemanticIndexWriter
def toString(): String

Definition Classes
AnyRef → Any
def unapply(lsi: LatentSemanticIndex): (Map[Int, Seq[String]], Map[Int, (String, Int, Int)], DenseMatrix[Double], SVD[DenseMatrix[Double], DenseVector[Double]])
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def writeBinary(lsi: LatentSemanticIndex)(path: String): Option[Boolean]

write the model in binary format instead of a zip archive.
write the model in binary format instead of a zip archive.

Definition Classes
LatentSemanticIndexWriter
def writeDocIdMap(path: String, docIdMap: Map[Int, Seq[String]]): Try[Boolean]

write the document Id Map
write the document Id Map

Definition Classes
LatentSemanticIndexWriter
def writeTermMap(path: String, termMap: Map[Int, (String, Int, Int)]): Try[Boolean]

Definition Classes
LatentSemanticIndexWriter
def writeZip(index: LatentSemanticIndex)(workingPath: String)(targetPath: String): Try[Boolean]

Definition Classes
LatentSemanticIndexWriter
def writeZipTemp(index: LatentSemanticIndex)(path: String): Try[Boolean]

write the latent semantic index structure to the supplied path.
write the latent semantic index structure to the supplied path.

Definition Classes
LatentSemanticIndexWriter

Related Docs: class LatentSemanticIndex | package model

object LatentSemanticIndex extends LatentSemanticIndexWriter with LatentSemanticIndexReader with LatentSemanticIndexBuilder with LsiDocumentSearch with Serializable

Type Members

type LsiSearchSpace = (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def apply(docIdMap: Map[Int, Seq[String]], colTermMap: Map[Int, (String, Int, Int)], tfIdf: DenseMatrix[Double], svD: SVD[DenseMatrix[Double], DenseVector[Double]]): LatentSemanticIndex

final def asInstanceOf[T0]: T0

def buildFromCsv(inputCsv: String, docIdCols: Seq[Int], skipHeader: Boolean = true, stemTerms: Boolean = true): (Double, DenseVector[Double], LatentSemanticIndex)

def clone(): AnyRef

def computeSvd(tfIdf: DenseMatrix[Double]): (DenseSVD, Double, DenseVector[Double])

def computeTfIdf(terms: Seq[Array[String]]): (Map[Int, (String, Int, Int)], DenseMatrix[Double])

val docMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def extractAndReadFromPath(zipFile: String)(path: String): Try[LatentSemanticIndex]

def extractStemmedTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

def extractTerms(stopwords: Seq[String], lines: Seq[String]): Seq[Array[String]]

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def loadStopWords(): Seq[String]

def makeSearchSpace(lsi: LatentSemanticIndex): (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def performSearch(searchSpace: (DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double]), vectoriser: DocumentTermVectoriser, query: Array[String], stopWords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): Buffer[(Int, Double, Seq[String])]

def preprocessQuery(vectoriser: DocumentTermVectoriser)(query: Array[String], stopwords: Seq[String], lsi: LatentSemanticIndex, stemQuery: Boolean = true): DenseVector[Double]

def readBinary(filePath: String): Option[LatentSemanticIndex]

def readColTermMap(path: String): Map[Int, (String, Int, Int)]

def readDocMap(path: String): Map[Int, Seq[String]]

def readIndexedCsv(path: String, docIdCols: Seq[Int], skipHeader: Boolean = true): (Map[Int, Seq[String]], ListBuffer[String])

def readZip(zipFile: String): Try[LatentSemanticIndex]

def reduceToDimensons(lsi: LatentSemanticIndex, k: Int): LatentSemanticIndex

final def synchronized[T0](arg0: ⇒ T0): T0

val termMapWriter: CsvWriter { ... /* 2 definitions in type refinement */ }

def toString(): String

def unapply(lsi: LatentSemanticIndex): (Map[Int, Seq[String]], Map[Int, (String, Int, Int)], DenseMatrix[Double], SVD[DenseMatrix[Double], DenseVector[Double]])

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def writeBinary(lsi: LatentSemanticIndex)(path: String): Option[Boolean]

def writeDocIdMap(path: String, docIdMap: Map[Int, Seq[String]]): Try[Boolean]

def writeTermMap(path: String, termMap: Map[Int, (String, Int, Int)]): Try[Boolean]

def writeZip(index: LatentSemanticIndex)(workingPath: String)(targetPath: String): Try[Boolean]

def writeZipTemp(index: LatentSemanticIndex)(path: String): Try[Boolean]

Inherited from Serializable

Inherited from Serializable

Inherited from LsiDocumentSearch

Inherited from LatentSemanticIndexBuilder

Inherited from LatentSemanticIndexReader

Inherited from LatentSemanticIndexWriter

Inherited from AnyRef

Inherited from Any

Ungrouped