recompute the entire latent semantic index from the data stored in the supplied CSV file.
recompute the entire latent semantic index from the data stored in the supplied CSV file. The CSV file must include one or more document ids for each row and must include the document text content in the remaining column.
Eg: DocID1,SubID2,Text 1, 111 , "This is some text data"
The fields should be quoted either with " or '.
The return result includes the latent semantic index, along with the entropy of the terms calculated from the data set and the contributions of each singular component to the total entropy of the document set.
compute the Svd along with the entropy of the data set and the contributions of each singular value component.
compute the Svd along with the entropy of the data set and the contributions of each singular value component.
compute the TfIdf matrix
compute the TfIdf matrix
read the SVD from the supplied path.
read the SVD from the supplied path.
extract stemmed terms from the supplied sequence of lines
extract stemmed terms from the supplied sequence of lines
extract terms without stemming.
extract terms without stemming.
load the default set of stopwords using the embedded stopwords loader.
load the default set of stopwords using the embedded stopwords loader.
generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
Note the search space may need to be cached
search in the lsi model with an array query.
search in the lsi model with an array query.
convert the query into a term vector.
convert the query into a term vector.
: LatentSemanticIndex
the query term vector. currently the term vector is a tfidf vector for the query based on the lsi model. However there are other methods of weighting terms so it will be changed to supply a counting trait to calculate the term weights
read from binary
read from binary
read the column term map from CSV.
read the column term map from CSV.
read the document index map
read the document index map
read an indexed CSV file containing document Ids for each row and string text for each document.
read an indexed CSV file containing document Ids for each row and string text for each document.
read from the zip archive.
read from the zip archive.
The singular values can dictate how many dimensions we should retain.
The singular values can dictate how many dimensions we should retain. After manual analysis it may be decided that we only want to retain a certain fixed number of dimensions
Hence this will result in an SVD of reduced dimensionality where k is the number of principle components.
The original SVD
$$ \hat{X} = U S V' $$
where $U$ has dimension $(m x n)$ and $S$ has size $n$ and $Vt$ has dimension $(n x n)$
If we choose dimension $k < n$ then we have
$U$ dimension $(m x k)$ $S$ dimension $k$ $V'$ dimension $(k x n)$
Note that this does not reduce the original tfidf matrix, but will reduce the dimensions that the projects of the tfidf into the search space will have.
Note that $k < n$
write the model in binary format instead of a zip archive.
write the model in binary format instead of a zip archive.
write the document Id Map
write the document Id Map
write the latent semantic index structure to the supplied path.
write the latent semantic index structure to the supplied path.