Abstract

Top-k sequence pattern mining with non-overlapping condition

Xin Chai, Dan Yang, Jingyu Liu, Yan Li, Youxi Wu

Pattern mining has been widely applied in many fields. Users often mine a large number of patterns. However, most of these are difficult to apply in real applications. Top-k pattern mining, which involves finding the most frequent k patterns, is an effective strategy, because the more frequently a pattern occurs, the more likely they are to be important for users. However, top-k mining can only mine short patterns in mining applications with the Apriori property. It is well-known that short patterns contain less information than long patterns. In this paper, we focus on mining top-k sequence patterns of each pattern length. We propose an effective algorithm, named NOSTOPK (non-overlapping sequence pattern mining for top-k). The algorithm calculates the support of a pattern using a Nettree data structure, which has been introduced to tackle various types of pattern matching and sequence pattern mining issues. We find the top k patterns of length len, and calculate the supports of the corresponding k× |Σ| super-patterns of length len + 1 to discover the new top k super-patterns with len + 1. Experimental results demonstrate that the algorithm achieves a better performance than comparable algorithms,