Overview of JPEG
As mentioned earlier, JPEG actually involves 5 different compression methods.
Let's look at the process of creating a JPEG, start to finish.
1. Sub-sampling. First up, you have to know that most computer images
are stored as "RGB", which stands for red/green/blue. In a 24-bit
image, each color component is assigned 8 bits, for a total of 256 different
possible values. That means that every color you see on your monitor is
constructed by picking one of 256 values of pure red, one of 256 of pure green,
one of 256 of pure blue, and then combining those colors into one. 256 x 256 x
256 = ~ 16 million, so your "average" computer image has 16 million
colors to choose from for each pixel. OK, so what? Well, JPEG doesn't use RGB,
strictly speaking. The first step in JPEG is converting RGB to another
"color space" called YCC. YCC is very similar to the color scheme used
by color televisions. The 'Y' component is the black-and-white signal. If you
looked at the Y component alone, it would look just like a B&W version of
your original image. The two 'C' color components ('Cr' and 'Cb') are their
actual names represent the color in the image. Collectively, the 'C' components
are called the "chrominance," and the 'Y' is called the
'luminance.'
OK, back to JPEG. The first step in the process is to convert from RGB to
YCC. This step is lossless, and does not involve any actual compression. But
right after converting to YCC, JPEG does do some compression. First up, you
split the image into 16x16 blocks of pixels, called 'macroblocks.' Each 16x16
macroblock consists of 4 smaller blocks of 8x8 pixels each. Here's where the
compression comes in: JPEG "subsamples" the chrominance blocks. You
should have 4 blocks of Y and 4 blocks of Cr and 4 blocks of Cb per macroblock,
right? Well, JPEG throws away 3 of the Cr blocks, stretching 1 8x8 block of Cr
to fit the whole 16x16 macroblock. The same is true for the Cb block. So you
started with 12 blocks of data per macroblock (4 each of Y, Cr, and Cb), but now
you only have 6 blocks (4 of Y, 1 of Cr, and 1 of Cb). Right off the bat, JPEG
has achieved a 2-to-1 compression...at the cost of throwing away some
information, though. Is this good or bad?
It turns out that your eye is very sensitive to changes in luminance, but not
so sensitive to changes in chrominance. As a result, you can "stretch"
the color information from an 8x8 block to fit an entire 16x16 macroblock
without your eye being able to detect the difference.
2. DCT. DCT stands for "discrete cosine transform,"
and this is the heart of the JPEG algorithm. If you're familiar with a
Fourier transform, DCT is the same idea except that it uses only cosines for the
basis functions. If you've never encountered Fourier transforms, the basic idea
is that you can transform a signal to determine what frequency components make
up that signal. Think of striking two notes on a piano at the same time; your
ear can detect that there are two different notes in there, and if you have a
good ear you can figure out which exact notes. A Fourier transform is a
mathematical way of performing the same process. DCT does a similar thing to the
blocks of image data, except that it changes a 2-D block of pixels into a 2-D
block of frequency components. The DCT stage of JPEG is "almost"
lossless, but it does have some losses. In principle, you need infinite
precision floating point numbers (and an infinite array of them) to do this sort
of conversion. But JPEG uses integers, and only 64 (i.e. 8x8) of them, so the
DCT has to throw away some data.
3. Quantization This is where the major compression takes place
in JPEG. After the DCT stage, you have an 8x8 block of frequency components.
Then you build an 8x8 "quantization" block, which is just an 8x8 block
of numbers. You take each number in the 8x8 DCT block and divide it by the
corresponding number in the quantization block. Obviously, the larger the number
in the quant block, the smaller the resulting dividend will be. But wait,
there's more. The division is an integer process. For instance, if you divide 8
by 3, normally you'd expect 2.6666 to be your answer. But in 'integer math,' the
answer is 2 with a remainder of 2...except we don't keep remainders, so 8 / 3 =
2 in integer math. That's how the quantization is done. But there's an
implication there: any dividend that is smaller than 1 will be set to zero.
Later on, the zeros will be discarded.
The quantization stage is where you get to pick your image quality. When you
choose "high quality," the software supplies a quantization block
whose values are close to 1, so fewer cells have their values get truncated to
zero. When you choose "low quality," the software supplies a
quantization block with large values, so more and more cells of image data get
divided down to zero. This is the reason that its difficult to compare the image
quality settings of different programs; JPEG doesn't specify the values of the
quantization blocks, so software developers are free to choose their own. The
quant block gets written into the final JPEG file so the image-display program
will know exactly what values were used in the quant block -- but the actual
values are up to the software developers. This gives the developers tremendous
freedom, but it also means that two different software packages can have
radically different quant blocks (and thus radically different quantization
results), even at the same quality setting.
The quantization step is the most lossy of all the steps, but it depends
entirely on the quant block i.e. the image quality setting.
4. Run-length Encoding. This is the step that removes the zeros from
the quantized DCT data. Run-length encoding is a very old idea. Here's an
example: let's say we have a data stream like this: ABBBAACCCCD Instead of
writing that stream directly, you could write A3B2A4CD, which you should read as
"A" then "3 Bs" then "2 As" then "4 Cs"
then "D." Our original stream of data was 11 bytes long, and the
compressed one is just 8 bytes long. We saved 3 bytes. Clearly, though, the
savings are dependent on how often you have repeated strings of values, and how
long those strings are. You want longer strings of repeated values to make
run-length encoding work better. Well, JPEG really concerns itself only with
strings of zeroes, so in order to get good performance from the run-length
encoding stage, you want long strings of zeroes. JPEG "encourages"
this with the quant matrix. If you use larger and larger values as you go
"down" and "across" the matrix, you pretty much ensure that
you'll get a general trend of more and more zeroes in the process. This is
"OK" from a visual standpoint, because the later values represent
higher-frequency components. In other words, the DCT values that get truncated
are the ones that allow for very sharp transitions between one color (or
brightness) and the next in adjacent pixels. This is why your images start to
look blurry at lower-quality settings; you are throwing away the information
that allows them to be sharp!
Run-length encoding is lossless.
5. Huffman Coding. This is the last stage, and it's difficult to
explain. Basically, Huffman coding removes statistical repetitions in the data.
It does this by using short bit-strings to represent commonly occurring values,
and longer strings to represent less common values. Imagine how much space you
could save if you could store the letter 'e' in half the space! Well, Huffman
coding is a way to do just that. An example: ASCII coding is the
more-or-less universal method for storing text written in English. It uses 7
bits to store each & every character. In ASCII coding, a lower-case 'e' has
a value of 101, which in binary is 1100101. What if we chose to represent 'e'
with a single binary bit, say '1', instead of the 7-bit string listed above?
Great! But then we're stuck, aren't we, because we could only have two letters
in our alphabet, right? Well, no. The next most common letter is 't'. What if we
used '01' for 't' ? And how about '011' for 'a'? and '0111' for 'o' ? You get
the idea. By using smaller bit-strings to represent more common values, you can
use fewer bits overall to encode the whole data stream. JPEG uses Huffman coding
to encode the run-length values from stage 4. It's a small increase in
compression, maybe 10-20% at best, but it's cheap, easy, and better than
nothing.
Huffman coding is lossless.
So there's JPEG in a nutshell. There are 5 stages in the compression. 3 are
lossless, 2 are lossy, and of the 2 lossy stages you can control the 'lossiness'
in only one.
|