HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2001 | Volume: 1 | Issue: 4 | Page No.: 446-451
DOI: 10.3923/jas.2001.446.451
Content Based Compression of Turkish Documents
Banu Diri

Abstract: The main goal of this study is to analyse the morphological structure of the Turkish documents. The new proposed method consists of lossless compressing the monograms, digrams, trigrams, roots-stems and suffixes individually using a statistical approach. In this work 1g, 2 g, 3 g, root-system and suffix frequencies have been computed for Turkish language. A tuned template has been prepared for each group. The compression of Turkish documents has been performed by using the static Huffman Coding for Word Based Dynamic Huffman has been measured.

Fulltext PDF

How to cite this article
Banu Diri , 2001. Content Based Compression of Turkish Documents. Journal of Applied Sciences, 1: 446-451.

Keywords: huffman coding, language modelling, text compression, n gram models and turkish corpus

REFERENCES

  • Diri, B., 1999. Turkcenin bicimbilim yapisina dayali bir metin sikistirma sistemi. Ph.D. Thesis, Department of Computer Engineering, YTU, Istanbul, Turkey.


  • Gibson, J.D. and T. Berger, 1998. Digital Compression for Multimedia. Morgan Kauffmann, San Fransisco, CA


  • Knuit, D.E., 1985. Dynami chuffman conding. J. Algorithms, 6: 163-180.


  • Philips, D., 1992. LZW data compression. Comput. Appl. J., 27: 36-48.


  • Salmon, D., 1998. Data Compression. Springer, New York, USA


  • Kurumu, T.D., 1996. Imla Kilavuzu. Turk Tarih Kurumu Basimevi, Ankara, Turkey

  • © Science Alert. All Rights Reserved