Pattern-Based String Compression Function:
C source code

By Frank Cox
(April 28, 2010)


December 15, 2013 Update: Click here to download an implementation of this function written in RFO-Basic for Android.


This is an implementation of a very clever string compression function that I originally came across several years ago. The original implementation was written for Microsoft QuickBasic; I rewrote it as a C function.

The header on the original QuickBasic function states:

' Super Text/String Packer/Unpacker for QB4.5
' Originally by Greg Estabrooks
' Slightly changed by Hauke Daempfling

The purpose of this function is to compress a text string; it does a pretty good job of it as most text will compress to between 55-65% of the original size. This is, of course, not as good as the compression you will get out of a zip file or something similar to that, but the advantage of this function is that it is very small and self-contained, so it's easy to add into any program that you wish include text compression into. I have used it for storing screen layout templates, help files and even a custom spelling checker dictionary. I'm sure it has other uses too; I just haven't come across them yet.

THEORY OF OPERATION

This text compression function contains a dictionary string that contains common two-letter combinations. The function compares the original text string letter-by-letter to the dictionary combinations and if the combination is found in the original string it saves the location of the combination in the dictionary to the result (compressed) string. If the combination is not found, it saves the letter that is not found to the result string.

Therefore, the maximum compression that can possibly be achieved by this program is 50%, if every letter in the string contains matches within the dictionary string. The worst compression ratio is none at all, if every letter in the string fails to match within the dictionary string. Depending on the nature of what you are compressing, you may find that a different dictionary string may give better performance; however, the dictionary string that came with the original QuickBasic program seems to work reasonably well.

Click here to download a small archive containing both the original QuickBasic version and the C version of the packtext function. Due to the fact that the spaces in dictionary string are critical to the compression, I recommend that you use the downloaded version rather than cutting-and-pasting the following source code.

IMPLEMENTATION NOTE

When decompressing a compressed string, be sure that both the text[] string that you feed the packtext function and RESULTSTRINGMAXLENGTH are dimensioned to contain enough characters to hold the uncompressed text string.

/* packtext.c
 * 
' Super Text/String Packer/Unpacker for QB4.5
'     Originally by Greg Estabrooks
'  Slightly changed by Hauke Daempfling
* 
* C version by Frank Cox <theatre@melvilletheatre.com>
* April 28, 2010
*
* packtext.c 
 Copyright (c) 2010, Frank Cox
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY FRANK COX ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL FRANK COX BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE. 
 */

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

#define RESULTSTRINGMAXLENGTH 255

int packtext(char *text, const bool compress);

int main(void)
        {
        int original;
        char text[]="Quidquid latine dictum sit, altum viditur.";
        original=strlen(text);
        printf("%s length=%ld\n",text,strlen(text));
        packtext(text,true);
        printf("compressed length=%ld\nCompression Ratio=%.0f%%\n",strlen(text),(float)strlen(text)/original*100);
        packtext(text,false);
        printf("String Restored: %s length=%ld",text,strlen(text));
        return 0;
        }

int packtext(char *text, const bool compress)
        {
        const char dictionary[]="  e  as  tinthouerhet anreesr d onn or o i y wo tontyo. neisarteed,  ctiy  bat snd fal pensestvengitu talehaurllcousa  mf dfoof siril  hmeg om Icehironsasiossbedepe rli Tetel nicho lilprcactutThpaeceachh wige ebuaisursulmawaotowtsmploI solyee Cunm rtieno Sdiwhs.rafincademe.irplk  ury Pwoacos gams,duayavucColamowe Aoopu";
        int counter,resultcounter=0;
        if (strlen(dictionary) != 320)
                {
                printf("ERROR: Dictionary has the wrong size\n");
                return 2;
                }
        if (compress)
                {
                int dictsearch;
                for (counter=0; counter < strlen(text); counter++)
                        if (text[counter] < 32 || text[counter]==127)
                                {
                                printf("ERROR: String contains characters with values out of range");
                                return 1;
                                }
                for (counter=0; counter < strlen(text); counter++, resultcounter++)
                        {
                        for (dictsearch=0; dictsearch < strlen(dictionary); dictsearch+=2)
                                if (text[counter]==dictionary[dictsearch] && text[counter+1]==dictionary[dictsearch+1])
                                        break;
                        if (dictsearch < strlen(dictionary)) // we found the characters that we want
                                {
                                text[resultcounter]=dictsearch/2+96;
                                counter++;
                                }
                        else
                                text[resultcounter]=text[counter]-31;
                        }
                text[resultcounter]='\0';
                }
        else //uncompress
                {
                char result[RESULTSTRINGMAXLENGTH];
                for (counter=0; counter < strlen(text); counter++,resultcounter++)
                        {
                        if (text[counter]<96 && text[counter] > 0)
                                result[resultcounter]=text[counter]+31;
                        else
                                {
                                if (text[counter]>95)
                                        {
                                        result[resultcounter]=dictionary[((int)text[counter]-96)*2];
                                        resultcounter++;
                                        result[resultcounter]=dictionary[((int)text[counter]-96)*2+1];
                                        }
                                else
                                        {
                                        result[resultcounter]=dictionary[(160+text[counter])*2];
                                        resultcounter++;
                                        result[resultcounter]=dictionary[(160+text[counter])*2+1];
                                        }
                                }
                        }
                result[resultcounter]='\0';
                strcpy(text,result);
                }
        return 0;
        }
        

Other articles written by Frank Cox can be found here.

Frank Cox owns and operates the Melville Theatre in Melville, Saskatchewan, Canada, and has been playing with computers for over 30 years.


This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Canada License.