INTERNET-DRAFT A. Costanzo draft-costanzo-lzju90-mime-01.txt AKC Computer Services Corp. Expires: March 8th, 1999 September 3, 1998 Definition of the LZJU90 MIME Content Transfer Encoding Type 1. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working docu- ments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast, or ftp.isi.edu (US West Coast). Distribution of this memo is unlimited. 2. Abstract This memo defines a new transfer encoding type for MIME, namely LZJU90. LZJU90 specifies a section consisting of an encoded binary or text object. The encoding provides both compression and representation in a text format. The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 3. Introduction The Multipurpose Internet Message Extensions (MIME) define a facility whereby messages may use a specific transfer encoding scheme to survive the process of transport through various mail delivery systems and transport agents. This Draft specifies and defines the LZJU90 transfer encoding for MIME compliant mail systems. This transfer encoding was first defined in RFC 1505 [1]. Costanzo [Page 1] EXPIRES IN SIX MONTHS September 3, 1998 LZJU90 specifies a section consisting of an encoded binary or text object. The encoding (defined below) provides both compression and representation in a text format. This encoding is advantageous and superior over other encoding schemes in that the resulting object is compressed, usually much smaller than an object using a transfer encoding of some other type. The resulting compressed object, is in the character set ISO-10646-UTF-8. [2] [3] 3. Definition of the LZJU90 Compressed Encoding LZJU90 is an encoding for a binary or text object to be sent in an Internet mail message. The encoding provides both compression and representation in a text format that will successfully survive transmission through the many different mailers and gateways that comprise the Internet and connected mail networks. 3.1 Overview The encoding first compresses the binary object, using a modified LZ77 algorithm, called LZJU90. It then encodes each 6 bits of the output of the compression as a text character, using a character set chosen to survive any translations between codes, such as ASCII to EBCDIC. The 64 six-bit strings 000000 through 111111 are represented by the characters "+", "-", "0" to "9", "A" to "Z", and "a" to "z". The output text begins with a line identifying the encoding. This is for visual reference only, the content transfer encoding header identifies the section to the user program. It also names the object that was encoded, usually by a file name. The format of this line is: * LZJU90 where is optional. For example: * LZJU90 foobar This is followed by the compressed and encoded data, broken into lines where convenient. It is recommended that lines be broken about every 76 characters length. The decoder must accept lines with 1 to 1000 characters on each line. After this, there is one final line that gives the number of bytes in the original data and a CRC of the original data. This should match the byte count and CRC found during decompression. This line has the format: * Costanzo [Page 2] EXPIRES IN SIX MONTHS September 3, 1998 where is a decimal number, and CRC is 8 hexadecimal digits. For example: * 4128076 5AC2D50E The count used in encoding the object. This numeral is the total number of lines, including the start and end lines that begin with *. 3.2 Specification of the LZJU90 compression The Lempel-Ziv-Storer-Szymanski model of mixing pointers and literal characters is used in the compression algorithm. Repeat occurrences of strings of octets are replaced by pointers to the earlier occurrence. The data compression is defined by the decoding algorithm. Any encoder that emits symbols which cause the decoder to produce the original input is defined to be valid. There are many possible strategies for the maximal-string matching that the encoder does, section 3.2.2 gives an example of one such algorithm. Regardless of which algorithm is used, and what tradeoffs are made between compression ratio and execution speed or space, the result can always be decoded by the simple decoder. The compressed data consists of a mixture of unencoded literal characters and copy pointers which point to an earlier occurrence of the string to be encoded. Compressed data contains two types of codewords: LITERAL pass the literal directly to the uncompressed output. COPY length, offset go back offset characters in the output and copy length characters forward to the current position. To distinguish between codewords, the copy length is used. A copy length of zero indicates that the following codeword is a literal codeword. A copy length greater than zero indicates that the following codeword is a copy codeword. To improve copy length encoding, a threshold value of 2 has been subtracted from the original copy length for copy codewords, because the minimum copy length is 3 in this compression scheme. Costanzo [Page 3] EXPIRES IN SIX MONTHS September 3, 1998 The maximum offset value is set at 32255. Larger offsets offer extremely low improvements in compression (less than 1 percent, typically). No special encoding is done on the LITERAL characters. However, unary encoding is used for the copy length and copy offset values to improve compression. A start-step-stop unary code is used. A (start, step, stop) unary code of the integers is defined as follows: The Nth codeword has N ones followed by a zero followed by a field of size START + (N * STEP). If the field width is equal to STOP then the preceding zero can be omitted. The integers are laid out sequentially through these codewords. For example, (0, 1, 4) would look like: Codeword Range 0 0 10x 1-2 110xx 3-6 1110xxx 7-14 1111xxxx 15-30 Following are the actual values used for copy length and copy offset: The copy length is encoded with a (0, 1, 7) code leading to a maximum copy length of 256 by including the THRESHOLD value of 2. Codeword Range 0 0 10x 3-4 110xx 5-8 1110xxx 9-16 11110xxxx 17-32 111110xxxxx 33-64 1111110xxxxxx 65-128 1111111xxxxxxx 129-256 The copy offset is encoded with a (9, 1, 14) code leading to a maximum copy offset of 32255. Offset 0 is reserved as an end of compressed data flag. Codeword Range 0xxxxxxxxx 0-511 10xxxxxxxxxx 512-1535 110xxxxxxxxxxx 1536-3583 1110xxxxxxxxxxxx 3485-7679 11110xxxxxxxxxxxxx 7680-15871 11111xxxxxxxxxxxxxx 15872-32255 Costanzo [Page 4] EXPIRES IN SIX MONTHS September 3, 1998 The 0 has been chosen to signal the start of the field for ease of encoding. (The bit generator can simply encode one more bit than is significant in the binary representation of the excess.) The stop values are useful in the encoding to prevent out of range values for the lengths and offsets, as well as shortening some codes by one bit. The worst case compression using this scheme is a 1/8 increase in size of the encoded data. (One zero bit followed by 8 character bits). After the character encoding, the worst case ratio is 3/2 to the original data. The minimum copy length of 3 has been chosen because the worst case copy length and offset is 3 bits (3) and 19 bits (32255) for a total of 22 bits to encode a 3 character string (24 bits). 3.2.1 Sample Decoder As mentioned previously, the compression is defined by the decoder. Any encoder that produced output that is correctly decoded is by definition correct. The following is an implementation of the decoder, written more for clarity and as much portability as possible, rather than for maximum speed. When optimized for a specific environment, it will run significantly faster. /* LZJU 90 Decoding program */ /* Written By Robert Jung and Robert Ullmann, 1990 and 1991. */ /* This code is NOT COPYRIGHT, not protected. It is in the true Public Domain. */ #include #include typedef unsigned char uchar; typedef unsigned int uint; #define N 32255 #define THRESHOLD 3 #define STRTP 9 #define STEPP 1 #define STOPP 14 #define STRTL 0 #define STEPL 1 #define STOPL 7 Costanzo [Page 5] EXPIRES IN SIX MONTHS September 3, 1998 static FILE *in; static FILE *out; static int getbuf; static int getlen; static long in_count; static long out_count; static long crc; static long crctable[256]; static uchar xxcodes[] = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz"; static uchar ddcodes[256]; static uchar text[N]; #define CRCPOLY 0xEDB88320 #define CRC_MASK 0xFFFFFFFF #define UPDATE_CRC(crc, c) \ crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \ ^ (crc >> 8) #define START_RECD "* LZJU90" void MakeCrctable() /* Initialize CRC-32 table */ { uint i, j; long r; for (i = 0; i <= 255; i++) { r = i; for (j = 8; j > 0; j--) { if (r & 1) r = (r >> 1) ^ CRCPOLY; else r >>= 1; } crctable[i] = r; } } int GetXX() /* Get xxcode and translate */ { int c; do { if ((c = fgetc(in)) == EOF) c = 0; } while (c == '\n'); in_count++; return ddcodes[c]; } Costanzo [Page 6] EXPIRES IN SIX MONTHS September 3, 1998 int GetBit() /* Get one bit from input buffer */ { int c; while (getlen <= 0) { c = GetXX(); getbuf |= c << (10-getlen); getlen += 6; } c = (getbuf & 0x8000) != 0; getbuf <<= 1; getbuf &= 0xFFFF; getlen--; return(c); } int GetBits(int len) /* Get len bits */ { int c; while (getlen <= 10) { c = GetXX(); getbuf |= c << (10-getlen); getlen += 6; } if (getlen < len) { c = (uint)getbuf >> (16-len); getbuf = GetXX(); c |= getbuf >> (6+getlen-len); getbuf <<= (10+len-getlen); getbuf &= 0xFFFF; getlen -= len - 6; } else { c = (uint)getbuf >> (16-len); getbuf <<= len; getbuf &= 0xFFFF; getlen -= len; } return(c); } int DecodePosition() /* Decode offset position pointer */ { int c; int width; int plus; int pwr; plus = 0; pwr = 1 << STRTP; for (width = STRTP; width < STOPP; width += STEPP) { c = GetBit(); if (c == 0) Costanzo [Page 7] EXPIRES IN SIX MONTHS September 3, 1998 break; plus += pwr; pwr <<= 1; } if (width != 0) c = GetBits(width); c += plus; return(c); } int DecodeLength() /* Decode code length */ { int c; int width; int plus; int pwr; plus = 0; pwr = 1 << STRTL; for (width = STRTL; width < STOPL; width += STEPL) { c = GetBit(); if (c == 0) break; plus += pwr; pwr <<= 1; } if (width != 0) c = GetBits(width); c += plus; return(c); } void InitCodes() /* Initialize decode table */ { int i; for (i = 0; i < 256; i++) ddcodes[i] = 0; for (i = 0; i < 64; i++) ddcodes[xxcodes[i]] = i; return; } main(int ac, char **av) /* main program */ { int r; int j, k; int c; int pos; char buf[80]; char name[3]; long num, bytes; Costanzo [Page 8] EXPIRES IN SIX MONTHS September 3, 1998 if (ac < 3) { fprintf(stderr, "usage: judecode in out\n"); return(1); } in = fopen(av[1], "r"); if (!in){ fprintf(stderr, "Can't open %s\n", av[1]); return(1); } out = fopen(av[2], "wb"); if (!out) { fprintf(stderr, "Can't open %s\n", av[2]); fclose(in); return(1); } while (1) { if (fgets(buf, sizeof(buf), in) == NULL) { fprintf(stderr, "Unexpected EOF\n"); return(1); } if (strncmp(buf, START_RECD, strlen(START_RECD)) == 0) break; } in_count = 0; out_count = 0; getbuf = 0; getlen = 0; InitCodes(); MakeCrctable(); crc = CRC_MASK; r = 0; while (feof(in) == 0) { c = DecodeLength(); if (c == 0) { c = GetBits(8); UPDATE_CRC(crc, c); out_count++; text[r] = c; fputc(c, out); if (++r >= N) r = 0; } Costanzo [Page 9] EXPIRES IN SIX MONTHS September 3, 1998 else { pos = DecodePosition(); if (pos == 0) break; pos--; j = c + THRESHOLD - 1; pos = r - pos - 1; if (pos < 0) pos += N; for (k = 0; k < j; k++) { c = text[pos]; text[r] = c; UPDATE_CRC(crc, c); out_count++; fputc(c, out); if (++r >= N) r = 0; if (++pos >= N) pos = 0; } } } fgetc(in); /* skip newline */ if (fscanf(in, "* %ld %lX", &bytes, &num) != 2) { fprintf(stderr, "CRC record not found\n"); return(1); } else if (crc != num) { fprintf(stderr, "CRC error, expected %lX, found %lX\n", crc, num); return(1); } else if (bytes != out_count) { fprintf(stderr, "File size error, expected %lu, found %lu\n", bytes, out_count); return(1); } else fprintf(stderr, "File decoded to %lu bytes correctly\n", out_count); Costanzo [Page 10] EXPIRES IN SIX MONTHS September 3, 1998 fclose(in); fclose(out); return(0); } 3.2.2 An Example of an Encoder Many algorithms are possible for the encoder, with different tradeoffs between speed, size, and complexity. The following is an example program which is fairly efficient; more sophisticated implementations will run much faster, and produce somewhat better compression. This example also shows that the encoder need not use the entire window available. Not using the full window costs a small amount of compression, but can greatly increase the speed of some algorithms. /* LZJU 90 Encoding program */ /* Written By Robert Jung and Robert Ullmann, 1990 and 1991. */ /* This code is NOT COPYRIGHT, not protected. It is in the true Public Domain. */ #include typedef unsigned char uchar; typedef unsigned int uint; #define N 24000 /* Size of window buffer */ #define F 256 /* Size of look-ahead buffer */ #define THRESHOLD 3 #define K 16384 /* Size of hash table */ #define STRTP 9 #define STEPP 1 #define STOPP 14 #define STRTL 0 #define STEPL 1 #define STOPL 7 #define CHARSLINE 78 static FILE *in; static FILE *out; static int putlen; static int putbuf; static int char_ct; static long in_count; static long out_count; Costanzo [Page 11] EXPIRES IN SIX MONTHS September 3, 1998 static long crc; static long crctable[256]; static uchar xxcodes[] = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz"; uchar window_text[N + F + 1]; /* text contains window, plus 1st F of window again (for comparisons) */ uint hash_table[K]; /* table of pointers into the text */ #define CRCPOLY 0xEDB88320 #define CRC_MASK 0xFFFFFFFF #define UPDATE_CRC(crc, c) \ crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \ ^ (crc >> 8) void MakeCrctable() /* Initialize CRC-32 table */ { uint i, j; long r; for (i = 0; i <= 255; i++) { r = i; for (j = 8; j > 0; j--) { if (r & 1) r = (r >> 1) ^ CRCPOLY; else r >>= 1; } crctable[i] = r; } } void PutXX(int c) /* Translate and put xxcode */ { c = xxcodes[c & 0x3F]; if (++char_ct > CHARSLINE) { char_ct = 1; fputc('\n', out); } fputc(c, out); out_count++; } Costanzo [Page 12] EXPIRES IN SIX MONTHS September 3, 1998 void PutBits(int c, int len) /* Put rightmost "len" bits of "c" */ { c <<= 16 - len; c &= 0xFFFF; putbuf |= (uint) c >> putlen; c <<= 16 - putlen; c &= 0xFFFF; putlen += len; while (putlen >= 6) { PutXX(putbuf >> 10); putlen -= 6; putbuf <<= 6; putbuf &= 0xFFFF; putbuf |= (uint) c >> 10; c = 0; } } void EncodePosition(int ch) /* Encode offset position pointer */ { int width; int prefix; int pwr; pwr = 1 << STRTP; for (width = STRTP; ch >= pwr; width += STEPP, pwr <<= 1) ch -= pwr; if ((prefix = width - STRTP) != 0) PutBits(0xffff, prefix); if (width < STOPP) width++; /* else if (width > STOPP) abort(); do nothing */ PutBits(ch, width); } void EncodeLength(int ch) /* Encode code length */ { int width; int prefix; int pwr; pwr = 1 << STRTL; for (width = STRTL; ch >= pwr; width += STEPL, pwr <<= 1) ch -= pwr; if ((prefix = width - STRTL) != 0) PutBits(0xffff, prefix); if (width < STOPL) width++; /* else if (width > STOPL) abort(); do nothing */ PutBits(ch, width); } Costanzo [Page 13] EXPIRES IN SIX MONTHS September 3, 1998 main(int ac, char **av) /* main program */ { uint r, s, i, c; uchar *p, *rp; int match_position; int match_length; int len; uint hash, h; if (ac < 3) { fprintf(stderr, "usage: juencode in out\n"); return(1); } in = fopen(av[1], "rb"); if (!in) { fprintf(stderr, "Can't open %s\n", av[1]); return(1); } out = fopen(av[2], "w"); if (!out) { fprintf(stderr, "Can't open %s\n", av[2]); fclose(in); return(1); } char_ct = 0; in_count = 0; out_count = 0; putbuf = 0; putlen = 0; hash = 0; MakeCrctable(); crc = CRC_MASK; fprintf(out, "* LZJU90 %s\n", av[1]); /* The hash table inititialization is somewhat arbitrary */ for (i = 0; i < K; i++) hash_table[i] = i % N; r = 0; s = 0; /* Fill lookahead buffer */ for (len = 0; len < F && (c = fgetc(in)) != EOF; len++) { UPDATE_CRC(crc, c); in_count++; window_text[s++] = c; } Costanzo [Page 14] EXPIRES IN SIX MONTHS September 3, 1998 while (len > 0) { /* look for match in window at hash position */ h = ((((window_text[r] << 5) ^ window_text[r+1]) << 5) ^ window_text[r+2]); p = window_text + hash_table[h % K]; rp = window_text + r; for (i = 0, match_length = 0; i < F; i++) { if (*p++ != *rp++) break; match_length++; } match_position = r - hash_table[h % K]; if (match_position <= 0) match_position += N; if (match_position > N - F - 2) match_length = 0; if (match_position > in_count - len - 2) match_length = 0; /* ! :-) */ if (match_length > len) match_length = len; if (match_length < THRESHOLD) { EncodeLength(0); PutBits(window_text[r], 8); match_length = 1; } else { EncodeLength(match_length - THRESHOLD + 1); EncodePosition(match_position); } for (i = 0; i < match_length && (c = fgetc(in)) != EOF; i++) { UPDATE_CRC(crc, c); in_count++; window_text[s] = c; if (s < F - 1) window_text [s + N] = c; if (++s > N - 1) s = 0; hash = ((hash << 5) ^ window_text[r]); if (r > 1) hash_table[hash % K] = r - 2; if (++r > N - 1) r = 0; } while (i++ < match_length) { if (++s > N - 1) s = 0; hash = ((hash << 5) ^ window_text[r]); if (r > 1) hash_table[hash % K] = r - 2; if (++r > N - 1 ) r = 0; len--; } } Costanzo [Page 15] EXPIRES IN SIX MONTHS September 3, 1998 /* end compression indicator */ EncodeLength(1); EncodePosition(0); PutBits(0, 7); fprintf(out, "\n* %lu %08lX\n", in_count, crc); fprintf(stderr, "Encoded %lu bytes to %lu symbols\n", in_count, out_count); fclose(in); fclose(out); return(0); } 3.3.1 Example of a MIME LZJU90 Compressed Object The following is an example of a MIME LZJU90 compressed object. Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: LZJU90 * LZJU90 example 8-mBtWA7WBVZ3dEBtnCNdU2WkE4owW+l4kkaApW+o4Ir0k33Ao4IE4kk bYtk1XY618NnCQl+OHQ61d+J8FZBVVCVdClZ2-LUI0v+I4EraItasHbG VVg7c8tdk2lCBtr3U86FZANVCdnAcUCNcAcbCMUCdicx0+u4wEETHcRM 7tZ2-6Btr268-Eh3cUAlmBth2-IUo3As42laIE2Ao4Yq4G-cHHT-wCEU 6tjBtnAci-I++ * 190 081E2601 4. Security Consideration The security (or lack) is responsibility of the application domain controlling the decoder of the LZJU90 object. 5. References [1] Costanzo, A. Robinson, D. and R. Ullmann, "Encoding Header Field for Internet Messages", RFC 1505, AKC Consulting, Prime Computer, Inc., August 1993. [2] International Organization for Standardization, Information Technology -- Universal Coded Character Set (UCS). ISO/IEC 10646-1:1993, June 1993. [3] International Organization for Standardization, Information Technology -- Universal Coded Character Set (UCS). ISO/IEC 10646-1: 1993/AMD.2: 1996 (E) Costanzo [Page 16] EXPIRES IN SIX MONTHS September 3, 1998 6. Acknowledgments The author would like to thank Robert Jung, David Robinson and Robert Ullmann for their past contributions to this work. 7. Author's Address Al Costanzo AKC Computer Services Corp. P.O. Box 4031 Roselle Park, NJ 07204-0531 Phone: +1 908 298 9000 Email: AL@AKC.COM Costanzo [Page 17]