<IMG SRC="nonflash.gif" width=450 height=60 BORDER=0>

 

There exist different methods to model natural languages, among these, statistical techniques are the most common. Statistical Language Modeling aims at building a model that can estimate the distribution of a natural language as accurate as possible. These models are proved to be very successful in the field of speech and text recognition. A statistical language model (SLM) also has applications in information retrieval and NLP systems. Despite the many research efforts devoted to statistical modeling of English, to best our knowledge, there is no published work for Farsi language on the internet (if there is any). In this page, we hope to offer our new results and demonstrate that approaches to statistical modeling of English do apply to Farsi with no (or at least slight) modifications.

Related files:

Farsi Statistical Modeling Report: slm.pdf, Presentation File: farsi_slm.pres.pdf.
Farsi 41000-word dictionary (in unicode format): 41k_dic.
This dictionary is built by retaining the most frequent 41,000 words of a 100MB corpus.
Some of the Farsi words generated by the word model (fun and more):
    -  3-letter words: 1K3L
    -  4-letter words: 1K4L
    -  5-letter words: 1K5L
 

Supporting literature:

  • S. Young, G. Evermann, T. Hain, D. Kershaw, et. al., "The HTK Book", Cambridge University Press, 2002.

  • C. Jauvin and Y. Bengio, "A Sence-Smoothed Bigram Language Model", Technical Report 1232, Montreal University, 2003.

  • Raymond Lau, "Adaptive Statistical Language Modeling", SM Thesis, May 1994.

  • J. Goodman and J. Gao, "Language Model Size Reduction by Pruning and Clustering", Proc. of ICSLP 2000, Beijing, China, vol. 3, pp. 110-113, Oct. 2000.

  • X. Zhu and R. Rosenfeld, "Improving Trigram Language Modeling with the World Wide Web", Proc. of IEEE International Conference on ASSP, ,pp. 533-536, 2001.

  • A. Berger and R. Miller, "Just-In-Time Language Modeling", Proc. of IEEE International Conference on ASSP, Seattle, Washington, vol. II, pp. 705-708, 1998.

  • William B. Cavnar and John M. Trenkle, "N-Gram-Based Text Categorization", Proc. of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp.161-175, 1994.

  • R. Dugad and U. B. Desai, "A Tutorial on Hidden Markov Models",  Technical Report, Indian Institute of Technology, May 1996.

  • Witten and Bell, "The Zero Frequency Problem: Estimating the probabilities of Novel Events in Adaptive Text Compression", IEEE Trans on Information Theory, vol. 37(4), July 1991.

  • Slava M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer", IEEE Trans. on ASSP, vol. 35(3), pp. 400-401, March 1987.

Do not hesitate to contact me for further information: mehdi.haji@gmail.com