An N-gram Language Model Library from UC Berkeley
gram language models in memory and accessing them efficiently. It is described in this paper. Its data structures are faster and smaller than SRILM and nearly as fast as KenLM despite being written in Java instead of C++. It also achieves the best published lossless encoding of the Google n-gram corpus.
See here for some documentation.
July 16, 2014: The project has been migrated to github. Any future updates will happen there.
December 6, 2014: Since Google has deprecated downloads, I will no longer be uploading new versions for the time being. You can build the same tarball that I create by running the "export" target on the build.xml file in SVN.
June 9, 2013: version 1.1.5 has been released, which fixes a small bug with floating point rounding. Thanks to Giampiero Recco for finding this bug.
May 11, 2013: version 1.1.4 has been released, which fixes a bug with Kneser-Ney estimation on long sentences.
September 14, 2012: version 1.1.2 has been released, including some bug-fixes and documentation improvements. One particularly bad bug with stupid backoff LMs was fixed, so if you are using that code then please update. Also, binaries for Google n-gram-style LMs have been created from the Google Books corpora. You can download them here.
July 26, 2012: version 1.1.0 has been released, including improved memory usage and thread-safe caching. Prior to version 1.1, to make caching threadsafe, the programmer had to ensure that each thread had its own local copy of a caching wrapper. Version 1.1 provides a thread-safe cache that internally manages thread-local caches using Java's
June 4, 2012: This paper claims that Berkeley LM chops the mantissa of floats it stores to 12 bits. This is incorrect. This was inadvertent behavior that was fixed in 1.0b2. Because of the way BerkeleyLM encodes the floats it stores, correcting this behavior only added roughly an extra bit per value in our experiments. See this erratum for more details.
April 9, 2012: version 1.0.1 has been released. Fixes an occasional crash in estimation of Kneser-Ney models.
January 20, 2012: version 1.0.0 has been released. Fixes a bug in estimation of Kneser-Ney probabilities starting with the
August 14, 2011: version 1.0b3 has been released. This version can handle ARPA LM files which contain missing suffixes and prefixes. Also, we have released pre-built binaries for the Google N-Gram corpora. These can be downloaded here.
June 24, 2011: version 1.0b2 has been released, with bug fixes from Kenneth Heafield, and some performance improvements.
File listTips: You can preview the content of files by clicking file names^_^
- Page 1
- Total 1