Large-alphabet Chinese text compression using
adaptive Markov model and arithmetic coder
In this paper, an approach to compress Chinese text is proposed. This
approach first extends the alphabet to include those Chinese characters
in Big-5 code. Then, an adaptive Markov model is used to model the contextual
dependency, and arithmetic coding is used to encode the data more compactly.
In the case of large alphabet size, a practical implementation method for
adaptive Markov model is studied, and a data structure for fast arithmetic
coding is proposed. Furthermore, to demonstrate its credibility, this approach
has been programmed into a ready-to-use software package.
A series of experiments have been conducted. The two example texts
used are CX1 (from Chinese textbooks of primary school) and CX2 (from newspaper
editorials). The experiments show that the compression ratios (output file
length divided by input file length) obtained from our approach to be 42.3%
and 48.4% for CX1 and CX2 respectively, which are considerably better than
52.8% and 60.6%, from the popular software package ARJ, and 53.0% and 60.8%
from the package PKZIP. And for shorter files, our approach is still capable of
obtaining a much improved compression ratios.