NGRAMS - String Frequency in English Text

NGRAMS is a dataset directory which contains information about the observed frequency of "ngrams" (particular sequences of n letters) in English text.

In particular, a "monogram" is a single letter, and the file "english_monograms.txt" lists the number of occurrences of each of the 26 letters, with the most frequent letter given first.

The file "english_bigrams.txt" lists all 676 possible two letter sequences, and their observed occurrence, with the most frequent value listed first.

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Related Data and Programs:

GERMAN, a dataset directory which contains some short German texts;

TEXT, a dataset directory which contains actual "texts", such as the Gettysburg Address;

WORDS, a dataset directory which contains lists of words;

Datasets:

english_monograms.txt, 26 lines.
english_bigrams.txt, 676 lines = 26x26.
english_trigrams.txt, 17,556 lines = 26x26x26;
english_quadgrams.txt, 389,373 lines < 456,976 = 26x26x26x26.
english_quintgrams.txt, 4,354,914 lines < 11,881,376 = 26x26x26x26x26.

You can go up one level to the DATASETS directory.

Last revised on 12 February 2016.