NGRAMS is a dataset directory which contains information about the observed frequency of "ngrams" (particular sequences of n letters) in English text.
In particular, a "monogram" is a single letter, and the file "english_monograms.txt" lists the number of occurrences of each of the 26 letters, with the most frequent letter given first.
The file "english_bigrams.txt" lists all 676 possible two letter sequences, and their observed occurrence, with the most frequent value listed first.
The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.
GERMAN, a dataset directory which contains some short German texts;
TEXT, a dataset directory which contains actual "texts", such as the Gettysburg Address;
WORDS, a dataset directory which contains lists of words;
You can go up one level to the DATASETS directory.