Tikalon Blog by Dev Gualtieri

Data Compression

June 21, 2021

I took a course on rhetoric as a college freshman. It was taught by a rotund humanities professor who had the unenviable task of teaching this subject to a group of scientists and engineers whose only goal was to complete a humanities requirement. As you can imagine, not many textbooks were available for such a niche topic, and the thin volume we had was quite uninspiring. I did, however, glean a few important principles from this course, one of which was the need to express an idea clearly, but in as few words as possible.

Demosthenes on the Seashore by Eugène Delacroix (1859)

Rhetoric, the art of persuasion, is most commonly associated with the Greek orator, Demosthenes (Δημοσθένης, 384-322 BC).

Plutarch wrote that Demosthenes had a speech impediment as a boy, and it's thought that it was one in which he mispronounced r as an l. This is a condition known as de-rhotacism, named after the Greek letter for r, ρ (rho). This doesn't seem like that much of a problem, and I'm reminded of the way that news anchor, Tom Brokaw, pronounces his l.

It's written that Demosthenes cured himself of this speech impediment by shouting above the noise of ocean waves with pebbles in his mouth.

(Portion of Demosthenes on the Seashore, an 1859 painting by Eugène Delacroix (1798–1863), presently at the National Gallery of Ireland, via Wikimedia Commons)

Such brevity in speech is desirable, and this idea is applied technically in the various ways that language is coded into more compact forms. Prior to the development of rapid and accurate speech-to-text software, symbolic writing methods known as shorthand were used to allow people to write as rapidly as others would speak. One of my aunts learned shorthand in secretarial school. A mechanical means of recording shorthand known as stenotyping can be seen in courtroom scenes in older movies and television shows.

It's all Greek to me.

An example of shorthand from page 153 the 1916 book, "Gregg Shorthand. A Light-Line Phonography for the Million", by John Robert Gregg.

The idiom, "It's all Greek to me," can be found in Shakespeare's, The Tragedy of Julius Caesar (1599), in which one of the characters, Casca, describes a speech made in Greek by Cicero, a Roman scholar of Greek literature. Casca, who doesn't understand Greek, says that "... those that understood him smiled at one another and shook their heads; but, for mine own part, it was Greek to me."

(Portion of a Wikimedia Commons image.)

The first principal method of electrical signal compression was the telegraph Morse code, invented by portrait painter, Samuel F. B. Morse.[1-2] Morse was inspired to invent the telegraph after the death of his wife while he was absent from his New Haven, Connecticut, home while painting a portrait of the Marquis de Lafayette in Washington, D.C.. Morse, who had received news via a horseback messenger that his wife was ill, immediately returned home, but his wife died before he arrived. He reasoned that if he had received the news earlier, he would have been at her bedside, and at that point he devoted himself to the problem of telecommunication.

Morse was assisted in the invention of the telegraph by Alfred Vail,who also assisted in development of the code. Since certain letters are used more often than others, Morse created his code to make the more common letters easier to transmit. Thus "e" (frequency of occurrence 12.7%) is a single dot, a "t" (9.1%) is a single dash, while the less common "z" (0.074%) is dash-dash-dot-dot. The Federal Communications Commission abandoned its Morse code requirement for amateur radio operator licensing in 2005, and the United States Coast Guard stopped monitoring Morse Code distress calls in 1993. An interesting Morse code generating program was created by Frans van Dorsselaer in 1998, and I've included his C programming language source code here.

Binary tree of Morse code letters and numbers. Dot-heavy letters are to the left, and dash-heavy letters are to the right. In 1854, The US Supreme Court ruled (O'Reilly v. Morse) against applicability of the Morse Code apart from the telegraph implementation. This ruling confirmed the idea that an abstract idea is not patentable, only its implementation. (Created by the author using Inkscape, and also found at Wikimedia Commons. Click for larger image.)

Another example of useful data compression is the barcode, which is an automated method for retrieving data from a lookup table. The barcode was invented by Norman J. Woodland and Bernard Silver and patented in 1952.[3] The available electronics in 1952 were primitive, and this invention was demonstrated using vacuum tubes and a non-laser light source. The patent was eventually sold to Philco for just $15,000. Barcodes became useful only after the availability of inexpensive computers and laser light sources. Advances in machine vision have made the similar QR code possible.

Early mainframe computers had limited data storage, so data compression became an important topic. An early method of lossless data compression was Huffman coding, published in 1952 by computer scientist, David A. Huffman (1925-1999), when he was a doctoral student at MIT.[4] It works much like Morse code by creating a table of symbols for data items that's generated based on their frequency of occurrence. Instead of the dots and dashes of Morse code, binary ones and zeros are used.

At the advent of ubiquitous computing and the Internet, text was easy to store and send, but images were a problem. The saying that a picture is worth a thousand words was replaced by the need to represent even simple images by tens of thousands of word words of computer data. The solution was data compression using the Graphics Interchange Format (GIF). GIF was based on Lempel–Ziv–Welch (LZW) data compression, pioneered by Jacob Ziv (b. 1931) and Abraham Lempel (b. 1936) in 1977-78, and improved by Terry Welch (1939-1988) in 1983.[5-6] This algorithm was deemed significant enough to achieve IEEE Milestone status in 2004, an achievement reserved for such important things as the invention of the transistor.

The difference between Huffman coding and LZW is that LZW looks for the frequency of occurrence of repetitive sequences of data bytes, rather than individual data bytes, and uses these as table entries. For this reason it is especially effective in image encoding in which such repetition is common, such as for a uniform white background. The LZW patent (US Patent no. 4,558,302) was filed on June 20, 1983, and it was issued on December 10, 1965.[6] The importance of this algorithm was such that the patent was licensed to more than a hundred companies, the most important of which was the license to CompuServe in connection with GIF image encoding in 1993. Unisys, the patent owner, did not require licensing, or payment of fees, for non-commercial GIF usage, including GIFs on the Internet.

Portion of fig. 6 from US Patent 4,558,302

Some computer nostalgia. A snippet of FORTRAN source code, fig. 6 of US Patent No. 4,558,302, "High speed data compression and decompression apparatus and method," by Terry A. Welch, December 10, 1985.[6] (Via Google Patents.)

The Institute of Electrical and Electronics Engineers (IEEE) recently announced that Jacob Ziv, the sole member of the LZW team who was trained as an electrical engineer, has been awarded the 2021 IEEE Medal of Honor.[7] This award, first given in 1917, has been given to such luminaries as Edwin Armstrong (1890-1954), the inventor of Frequency modulation (1917), and Claude Elwood Shannon (1916-2001), the father of information theory," (1966).

Jacob Ziv in 2015.

Ziv experimented with electronic circuits as a boy, and his education was in electrical engineering. He was awarded the BSc (1954), Dip-Eng (1955), and MSc (1957), each in electrical engineering from Technion (Israel Institute of Technolog). His 1962 Sc.D from MIT was for research in message transmission in a noisy channel.[7]

Ziv is presently Technion Distinguished Professor Emeritus in its Faculty of Electrical Engineering,[7]

(Portion of a Wikimedia Commons photograph by Mark Neyman, image courtesy of the Spokesperson unit of the President of Israel.)

As with other awards, the IEEE Medal of Honor isn't given for just a single achievement, but for a life's work in a particular field. Beyond LZW, Ziv made other compression innovations. The citation for the IEEE medal is for his "... fundamental contributions to information theory and data compression technology, and for distinguished research leadership."[7] He is also the recipient of another IEEE awards, the Claude E. Shannon Award of the IEEE Information Theory Society.[7] Ziv, who is an IEEE fellow, has published about 100 peer-reviewed papers, and he is a member of the U.S. National Academy of Engineering, and the U.S. National Academy of Sciences.[7]

One complaint that Ziv made of his electrical engineering education was its primary focus on power systems and not electronics. As a consequence, he and his peers who were working to convert telemetry systems from vacuum tubes to transistors needed to teach themselves electronics.[7] He also had a complaint about computing early in his career. He says that he hated these punch card systems, and that's why he didn’t go into computer science.[7] That's also the reason why I never took a computer programming course during my own college days. I can still envision all those poor students trudging around campus with huge boxes of punch cards under their arms.

Ziv met Abraham Lempel as a fellow faculty member at Technion. Ziv says he and Lempel were the perfect match to tackle lossless data compression because of their complementary knowledge.[7] Ziv and Lempel were on simultaneous sabbaticals in the US, Ziv at Bell Labs, and Lempel at Sperry Rand, when their compression work was completed[7] Bell decided not to patent their algorithm, since software patents were not possible at the time, but Sperry decided to patent a hardware implementation.[7] Sperry subsequently patented Terry Welch's LZW algorithm.[6]

Data Compression

References: