Large-alphabet Chinese text compression using adaptive Markov model and arithmetic coder

In this paper, an approach to compress Chinese text is proposed. This approach first extends the alphabet to include those Chinese characters in Big-5 code. Then, an adaptive Markov model is used to model the contextual dependency, and arithmetic coding is used to encode the data more compactly. In the case of large alphabet size, a practical implementation method for adaptive Markov model is studied, and a data structure for fast arithmetic coding is proposed. Furthermore, to demonstrate its credibility, this approach has been programmed into a ready-to-use software package.

A series of experiments have been conducted. The two example texts used are CX1 (from Chinese textbooks of primary school) and CX2 (from newspaper editorials). The experiments show that the compression ratios (output file length divided by input file length) obtained from our approach to be 42.3% and 48.4% for CX1 and CX2 respectively, which are considerably better than 52.8% and 60.6%, from the popular software package ARJ, and 53.0% and 60.8% from the package PKZIP. And for shorter files, our approach is still capable of obtaining a much improved compression ratios.