• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Book chapter

Scalable Modified Kneser-Ney Language Model Estimation

P. 690-696.
Heafield K., Ivan Pouzyrevsky, Clark J. H., Koehn P.

We present an efficient algorithm to estimate
large modified Kneser-Ney models
including interpolation. Streaming
and sorting enables the algorithm to scale
to much larger models by using a fixed
amount of RAM and variable amount of
disk. Using one machine with 140 GB
RAM for 2.8 days, we built an unpruned
model on 126 billion tokens. Machine
translation experiments with this model
show improvement of 0.8 BLEU point
over constrained systems for the 2013
Workshop on Machine Translation task in
three language pairs. Our algorithm is also
faster for small models: we estimated a
model on 302 million tokens using 7.7%
of the RAM and 14.0% of the wall time
taken by SRILM. The code is open source
as part of KenLM.