A large-alphabet oriented scheme for Chinese and English text compression

In this paper, a large alphabet oriented scheme is proposed for both Chinese and English text compression. Our scheme parses Chinese text with the alphabet defined by Big-5 code, and parses English text with some rules designed here. Thus, the alphabet used for English is not a word alphabet. After parsed out into tokens, zero, first, and second order Markov models are used to estimate the occurrence probabilities of a token to be compressed. Then, the probabilities estimated are blended and accumulated in order to perform arithmetic coding. To implement arithmetic coding under large alphabet and probability-blending condition, a way to partition count-value range is studied. Our scheme has been programmed into practically executable packages. Then, typical Chinese and English text files are compressed to study the influences of alphabet size and prediction order. In average, our compression scheme can reduce a text file's size to 33.9% for Chinese and to 23.3% for English. These rates are comparable with or better than those obtained by famous data compression packages.