[Surfside] New York Times -- Fun With Your Zip Program: Sort Through Texts,
and More
Roman Zimmermann
roman@waldenweb.com
Sat, 18 May 2002 15:57:50 -0500
http://www.nytimes.com/2002/04/30/science/physical/30ZIP.html
(requires free registration)
Fun With Your Zip Program: Sort Through Texts, and More
By BRUCE SCHECHTER
One of the basic truths of the digital age is that almost anything --
the plays of Shakespeare, the genetic sequence of DNA, or the twitching
of a seismograph needle -- can be reduced to a sequence of ones and
zeroes. More striking is the discovery that these sequences are largely
full of hot air -- redundancies that add nothing to their meaning.
Clever computer programs can "zip" or compress these files, streamlining
them for speedier transmission. Zipping programs have long been a boon
to computer users with slow modem connections.
But now a group of Italian physicists has shown how these same programs
can be used to analyze and categorize text quickly. Using little more
than the zipping programs found on most personal computers, they can
easily distinguish between texts written in 10 different languages and
almost unfailingly tell which of a large group of texts were written by
the same author.
[ ... ]
"Transmitting an Italian text with a Morse code optimized for English
will result in the need of transmitting an extra number of bits," they
wrote. They conjectured that just how many extra bits it takes would be
a measure of the distance between English and Italian.
[ ... ]
The scientists performed a further test of their technique by analyzing
a single text that has been translated into many different languages --
in this case the Universal Declaration of Human Rights. The researchers
used their method to measure the linguistic "distance" between more than
50 translations of this document. From these distances, they constructed
a family tree of languages that is virtually identical to the one
constructed by linguists.
[ ... ]