Package

au.id.cxd.text

model

Permalink

package model

Visibility
  1. Public
  2. All

Type Members

  1. class LatentSemanticIndex extends Serializable

    Permalink

    The latent semantic index is a class that contains the data associated with the latent semantic index that is constructed from an indexing pipeline.

    The latent semantic index is a class that contains the data associated with the latent semantic index that is constructed from an indexing pipeline.

    This type will have a companion object that can read and write the associated data to and from a zip archive.

    a TF-IDF document term matrix a term column mapping a document id x row mapping and the SVD decomposition of the document.

    Created by cd on 10/1/17.

  2. trait LatentSemanticIndexBuilder extends AnyRef

    Permalink

    the pipeline that is used to build the index.

    the pipeline that is used to build the index.

    - build the tfidf matrix - perform the SVD - capture the entropy in the data set and the component contribution for each singular component. -

  3. trait LatentSemanticIndexReader extends AnyRef

    Permalink

    a reader trait for the latent semantic index.

  4. trait LatentSemanticIndexWriter extends AnyRef

    Permalink

    a writer trait for the latent semantic index data set.

  5. trait LsiComponentCluster extends AnyRef

    Permalink

    The LsiDocumentCluster approach uses the selected set of k components in order to generate k clusters.

    The LsiDocumentCluster approach uses the selected set of k components in order to generate k clusters.

    Documents are associated with clusters by the use of the U matrix.

    Terms are associated with clusters by the use of the Vt matrix

    If there are k components, the U matrix has dimension (m x k)

    where m is the number of documents.

    And Vt matrix has the dimension (k x n) where n is the number of attributes.

    Additionally the correlation matrices for attributes can be used to select highly correlated attributes once they have been clusered.

    This produces a cluster labelling for documents and terms based on the components defined in S.

    Created by cd on 13/1/17.

  6. trait LsiDocumentSearch extends AnyRef

    Permalink

    implementation of document search for supplied input query array.

    implementation of document search for supplied input query array. Note that the input query should contain terms that have previously existed within the LSI model. The LSI model contains a term map, the set of terms not found in the vocabulary are also returned with the result. The cosine distance is returned unnormalised.

Ungrouped