?
Construction of Efficient V-Gram Dictionary for Sequential Data Analysis
This paper presents a new method for constructing an optimal feature set from sequential data. It creates a dictionary of n-grams of variable length (we call them v-grams), based on the minimum description length principle. The proposed method is a dictionary coder and works simultaneously as both a compression algorithm and as unsupervised feature extraction. The length of constructed v-grams is not limited by any bound and exceeds 100 characters in provided experiments. Constructed v-grams can be used for any sequential data analysis and allows transfer bag-of-word techniques to non-text data types. The method demonstrates a high compression rate on various real-life datasets. Extracted features generate a practical basis for text classification, that shows competitive results on standard text classification collections without using the text structure. Combining extracted character v-grams with the words from the original text we achieved substantially better classification quality than on words or v-grams alone.