A new Chinese text compression scheme combining dictionary coding and adaptive alphabet-character grouping

In this paper, a new scheme is proposed for Chinese text compression. The factors, compression rate and decompression speed, are specially considered in order to help such applications as full-text searching. Actually, our scheme is based on the LZ77 scheme. The modifications made include alphabet- augmenting to obtain better compression rate, and adaptive-grouping to have faster processing speed when facing large alphabet. Alphabet-augmenting is meant to place the 32 control characters defined here and about 6,000 frequently used Chinese characters into the alphabet while adaptive-grouping is meant to move a referenced alphabet-character dynamically to another character group that use less bits to encode it. The alphabet characters are divided into 8 groups initially. To implement adaptive-grouping, a new strategy is proposed and compared with the convention strategy of move-to-front. In this paper, different schemes compared are all programmed as ready-to-use software package, i.e. not in simulation. The experiments show that the proposed strategy for implementing adaptive-grouping can not only obtain significant compression rate improvements but also have much faster processing speed than the strategy, move-to-front. In addition, the compression rates obtained by our scheme are better, in 5.4% and 7.5%, than those obtained from the popular software package, ARJ, when two example files with text data from Chinese textbooks of primary school and editorials of newspaper are processed respectively.