Apply a Dictionary Code to a Text File

**DICTIONARY_CODE**
is a MATLAB library which
can apply a dictionary code to a text file.

A common feature of lossless compression schemes is the construction of a "dictionary" of the symbols or words occuring in the file, and the replacement of symbols by dictionary indices.

These functions illustrate that idea, by starting with a version of the Gettysburg Address. In order to simplify our work, we remove punctuation and capitalization. Using MATLAB's "textread" function, we can create a cell array where each entry is a word in the file. Using MATLAB's unique() function we can construct a "dictionary" that lists in alphabetic order every word occurring in the file. Using a surprisingly obscure MATLAB function, we can then replace every word in the text file by its dictionary index. This is the operation of the "dictionary_encode()" function.

In order to decode or uncompress the file, we need both the encoded file and the dictionary. For our example, the dictionary is stored as a separate file, although compression schemes pack both the encoded text and the dictionary together. The function "dictionary_decode()" can then recover the original message.

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

**DICTIONARY_CODE** is available in
a MATLAB version.

ATBASH, a MATLAB library which applies the Atbash substitution cipher to a string of text.

CAESAR, a MATLAB library which can apply a Caesar Shift Cipher to a string of text.

CHRPAK, a MATLAB library which works with characters and strings.

FILUM, a MATLAB library which can work with information in text files.

MONOALPHABETIC, a MATLAB library which can apply a monoalphabetic substitution cipher to a string of text.

ROT13, a MATLAB library which can encipher a string using the ROT13 cipher for letters, and the ROT5 cipher for digits.

- dictionary_decode.m decodes our encoded file.
- dictionary_encode.m encodes our file.
- timestamp.m prints the current HMSDMY date as a timestamp;

- dictionary_code_test.m runs the tests.
- dictionary_code_test_output.txt the test output.
- gettysburg_address.txt a plain text version of the Gettysburg address.
- gettysburg_address2.txt a version of the Gettysburg address with spaces and punctuation and capitalization removed.
- gettysburg_address_encoded.txt an encoded version of the Gettysburg address.
- gettysburg_address_dictionary.txt the dictionary used to encode the Gettysburg address.

You can go up one level to the MATLAB source codes.