NGRAMS - String Frequency in English Text


NGRAMS is a dataset directory which contains information about the observed frequency of "ngrams" (particular sequences of n letters) in English text.

In particular, a "monogram" is a single letter, and the file "english_monograms.txt" lists the number of occurrences of each of the 26 letters, with the most frequent letter given first.

The file "english_bigrams.txt" lists all 676 possible two letter sequences, and their observed occurrence, with the most frequent value listed first.

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Related Data and Programs:

GERMAN, a dataset directory which contains some short German texts;

TEXT, a dataset directory which contains actual "texts", such as the Gettysburg Address;

WORDS, a dataset directory which contains lists of words;

Datasets:

You can go up one level to the DATASETS directory.


Last revised on 12 February 2016.