generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
generate the search space for the U and V components by multiplying them against the square root of the singular value diagonal matrix.
Note the search space may need to be cached
search in the lsi model with an array query.
convert the query into a term vector.
convert the query into a term vector.
: LatentSemanticIndex
the query term vector. currently the term vector is a tfidf vector for the query based on the lsi model. However there are other methods of weighting terms so it will be changed to supply a counting trait to calculate the term weights
The singular values can dictate how many dimensions we should retain.
The singular values can dictate how many dimensions we should retain. After manual analysis it may be decided that we only want to retain a certain fixed number of dimensions
Hence this will result in an SVD of reduced dimensionality where k is the number of principle components.
The original SVD
$$ \hat{X} = U S V' $$
where $U$ has dimension $(m x n)$ and $S$ has size $n$ and $Vt$ has dimension $(n x n)$
If we choose dimension $k < n$ then we have
$U$ dimension $(m x k)$ $S$ dimension $k$ $V'$ dimension $(k x n)$
Note that this does not reduce the original tfidf matrix, but will reduce the dimensions that the projects of the tfidf into the search space will have.
Note that $k < n$
implementation of document search for supplied input query array. Note that the input query should contain terms that have previously existed within the LSI model. The LSI model contains a term map, the set of terms not found in the vocabulary are also returned with the result. The cosine distance is returned unnormalised.