?
Using Annotated Suffix Trees for Fuzzy Full Text Search
A method for fuzzy full text search is proposed. The method
follows a popular two-stage scheme with a novel second stage: a prelim-
inary search stage using an n-gram inverted index and, at the second
stage, relevance checking between the query and documents using fre-
quency annotated suffix trees (ASTs). The ASTs are built for all docu-
ments of the collection off-line. The method is compared with two pop-
ular fuzzy text retrieval techniques, one using n-gram inverted indexing
with Levenshtein distance checking and signature hashing, and the other
being Lemur, a popular toolkit for language modelling and information
retrieval. For computational experiments we use ”Reuters 21578” text
collection and a collection of USPTO patents. Our AST-based method
generally leads to accuracy scores that are similar to those obtained
by the winner, the Levenshtein distance-based method. However, our
method significantly outperforms the Levenshtein distance based method
over speed. Therefore, when using both criteria, the accuracy and speed,
simultaneously, the AST-based method has shown significant advantages.