[Surfside] New York Times -- Fun With Your Zip Program: Sort Through Texts, and More

Roman Zimmermann roman@waldenweb.com
Sat, 18 May 2002 15:57:50 -0500


http://www.nytimes.com/2002/04/30/science/physical/30ZIP.html
(requires free registration)

Fun With Your Zip Program: Sort Through Texts, and More
By BRUCE SCHECHTER

One of the basic truths of the digital age is that almost anything -- 
the plays of Shakespeare, the genetic sequence of DNA, or the twitching 
of a seismograph needle -- can be reduced to a sequence of ones and 
zeroes. More striking is the discovery that these sequences are largely 
full of hot air -- redundancies that add nothing to their meaning. 
Clever computer programs can "zip" or compress these files, streamlining 
them for speedier transmission. Zipping programs have long been a boon 
to computer users with slow modem connections.

But now a group of Italian physicists has shown how these same programs 
can be used to analyze and categorize text quickly. Using little more 
than the zipping programs found on most personal computers, they can 
easily distinguish between texts written in 10 different languages and 
almost unfailingly tell which of a large group of texts were written by 
the same author.

[ ... ]

"Transmitting an Italian text with a Morse code optimized for English 
will result in the need of transmitting an extra number of bits," they 
wrote. They conjectured that just how many extra bits it takes would be 
a measure of the distance between English and Italian.

[ ... ]

The scientists performed a further test of their technique by analyzing 
a single text that has been translated into many different languages -- 
in this case the Universal Declaration of Human Rights. The researchers 
used their method to measure the linguistic "distance" between more than 
50 translations of this document. From these distances, they constructed 
a family tree of languages that is virtually identical to the one 
constructed by linguists.

[ ... ]