text_strip, a Python code which uses the "re" regular expression library to strip a text file of unwanted characters.
The information on this web page is distributed under the MIT license.
text_strip is available in a Python version.
markov_text, a Python code which uses a Markov Chain Monte Carlo (MCMC) process to sample an existing text file and create a new text that is randomized, but retains some of the structure of the original one.
ngrams, a dataset directory which contains information about the observed frequency of "ngrams" (particular sequences of n letters) in English text.
text_to_wordlist, a Python code which shows how to start with a text file, read its information into a single long string, and divide that string into individual words. This allows an investigator to analyze the text for patterns.
text, a dataset directory which contains some short English texts, such as Alice in Wonderland, the Gettysburg Address, Hamlet, Moby Dick, Robinson Crusoe, the Wizard of Oz;