bc... [google]
Home » Source Code » An N-gram Language Model Library from UC Berkeley

An N-gram Language Model Library from UC Berkeley

2016-05-18 12:54:33
The author
Download(s): 0
Point (s): 1 
Category Category:


gram language models in memory and accessing them efficiently. It is described in this paper. Its data structures are faster and smaller than SRILM and nearly as fast as KenLM despite being written in Java instead of C++. It also achieves the best published lossless encoding of the Google n-gram corpus.

See here for some documentation.


July 16, 2014: The project has been migrated to github. Any future updates will happen there.

December 6, 2014: Since Google has deprecated downloads, I will no longer be uploading new versions for the time being. You can build the same tarball that I create by running the "export" target on the build.xml file in SVN.

June 9, 2013: version 1.1.5 has been released, which fixes a small bug with floating point rounding. Thanks to Giampiero Recco for finding this bug.

May 11, 2013: version 1.1.4 has been released, which fixes a bug with Kneser-Ney estimation on long sentences.

September 14, 2012: version 1.1.2 has been released, including some bug-fixes and documentation improvements. One particularly bad bug with stupid backoff LMs was fixed, so if you are using that code then please update. Also, binaries for Google n-gram-style LMs have been created from the Google Books corpora. You can download them here.

July 26, 2012: version 1.1.0 has been released, including improved memory usage and thread-safe caching. Prior to version 1.1, to make caching threadsafe, the programmer had to ensure that each thread had its own local copy of a caching wrapper. Version 1.1 provides a thread-safe cache that internally manages thread-local caches using Java's ThreadLocal class. This incurs some performance overhead relative to the programmer manually ensuring thread-locality, but it still significantly faster than not using the cache at all.

June 4, 2012: This paper claims that Berkeley LM chops the mantissa of floats it stores to 12 bits. This is incorrect. This was inadvertent behavior that was fixed in 1.0b2. Because of the way BerkeleyLM encodes the floats it stores, correcting this behavior only added roughly an extra bit per value in our experiments. See this erratum for more details.

April 9, 2012: version 1.0.1 has been released. Fixes an occasional crash in estimation of Kneser-Ney models.

January 20, 2012: version 1.0.0 has been released. Fixes a bug in estimation of Kneser-Ney probabilities starting with the <s> tag. Also, several performance improvements, particularly in estimating Kneser-Ney probabilities. Note that binary compatibility was broken, so you will need to re-download all Google n-gram binaries.

August 14, 2011: version 1.0b3 has been released. This version can handle ARPA LM files which contain missing suffixes and prefixes. Also, we have released pre-built binaries for the Google N-Gram corpora. These can be downloaded here.

June 24, 2011: version 1.0b2 has been released, with bug fixes from Kenneth Heafield, and some performance improvements.

Sponsored links

File list

Tips: You can preview the content of files by clicking file names^_^
Name Size Date
0.00 B
Sponsored links


(Add your comment, get 0.1 Point)
Minimum:15 words, Maximum:160 words
  • 1
  • Page 1
  • Total 1

An N-gram Language Model Library from UC Berkeley (1.08 MB)

Need 1 Point(s)
Your Point (s)

Your Point isn't enough.

Get 22 Point immediately by PayPal

Point will be added to your account automatically after the transaction.

More(Debit card / Credit card / PayPal Credit / Online Banking)

Submit your source codes. Get more Points


Don't have an account? Register now
Need any help?
Mail to: support@codeforge.com


CodeForge Chinese Version
CodeForge English Version

Where are you going?

^_^"Oops ...

Sorry!This guy is mysterious, its blog hasn't been opened, try another, please!

Warm tip!

CodeForge to FavoriteFavorite by Ctrl+D