There exist different methods to model natural languages, among
these, statistical techniques are the most common. Statistical Language Modeling
aims at building a model that can estimate the distribution of a natural
language as accurate as possible. These models are proved to be
very successful in the field of speech and text recognition. A statistical
language model (SLM) also has applications in information retrieval and NLP
systems. Despite the many research efforts devoted to statistical modeling of
English, to best our knowledge, there is no published work for Farsi language on
the internet (if there is any). In this page, we hope to offer our new results and demonstrate that approaches to statistical modeling of English do
apply to Farsi with no (or at least slight) modifications.
Related files:
Farsi Statistical Modeling Report: slm.pdf, Presentation File:
farsi_slm.pres.pdf.
Farsi 41000-word dictionary (in unicode format):
41k_dic.
This dictionary is built by retaining the most frequent 41,000 words of a 100MB
corpus.
Some of the Farsi words generated by the word model (fun and more):
- 3-letter words: 1K3L - 4-letter words: 1K4L - 5-letter words: 1K5L
Supporting literature:
S. Young, G. Evermann, T. Hain, D. Kershaw, et. al., "The HTK
Book", Cambridge University Press, 2002.
C. Jauvin and Y. Bengio, "A Sence-Smoothed Bigram Language
Model", Technical Report 1232, Montreal University, 2003.
Raymond Lau, "Adaptive Statistical Language Modeling", SM
Thesis, May 1994.
J. Goodman and J. Gao, "Language Model Size Reduction by Pruning
and Clustering", Proc. of ICSLP 2000, Beijing, China, vol. 3, pp. 110-113, Oct.
2000.
X. Zhu and R. Rosenfeld, "Improving Trigram Language Modeling
with the World Wide Web", Proc. of IEEE International Conference on ASSP, ,pp.
533-536, 2001.
A. Berger and R. Miller, "Just-In-Time Language Modeling", Proc.
of IEEE International Conference on ASSP, Seattle, Washington, vol. II, pp.
705-708, 1998.
William B. Cavnar and John M. Trenkle, "N-Gram-Based Text
Categorization", Proc. of SDAIR-94, 3rd Annual Symposium on
Document Analysis and Information Retrieval, pp.161-175, 1994.
R. Dugad and U. B. Desai, "A Tutorial on Hidden Markov Models", Technical Report, Indian Institute of Technology, May 1996.
Witten and Bell, "The Zero Frequency Problem: Estimating the
probabilities of Novel Events in Adaptive Text Compression", IEEE Trans on
Information Theory, vol. 37(4), July 1991.
Slava M. Katz, "Estimation of Probabilities from Sparse Data for
the Language Model Component of a Speech Recognizer", IEEE Trans. on ASSP, vol.
35(3), pp. 400-401, March 1987.
Do not hesitate to contact me for further information: mehdi.haji@gmail.com