Authentic

Everyone knows when an email is spam, but no one wants to sift through the stuff manually (or trust crude Bayesian algorithms which tend to miss the real messages). On the other hand, it can be hard to pick out a digital photo that's been skillfully doctored.

Two new reports suggest strategies, using complex pattern recognition, can spot the authentic stuff for us.

Alert reader Ingrid Jones pointed me to this story on spam and DNA. Researchers have taken software for looking at conserved DNA sequences across species, and adapted it for sifting through spam. Interesting stuff, and more details can be found in the original piece in New Scientist:

Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as "Viagra".

Each pattern represented a common sequence of letters and numbers that had appeared in more than one unsolicited message. The researchers then ran a collection of known non-spam (dubbed "ham") through the same process, and removed the patterns that occurred in both groups...

...its rate of misidentifying genuine email as spam was just 1 in 6000 messages. Losing a single email in a torrent of spam is a greater failing in a filter than letting the occasional spam email through...

...Just as in genetic analysis, Teiresias could be taught that CCC and CCU codons both produce the same amino acid, proline, the anti-spam system can be trained to accept $ and s as identical.

I remember when I spent time on GenBank, searching for human alpha-4 integrin splice variants, building "consensus sequences" and designing probes. Good times.

The story reminded me of a finding this spring on the authenticity of digital images:

Hany Farid, a professor of computer science at Dartmouth's Institute for Security Technology Studies, has developed a method to analyze the mathematical data behind digital images. Each pixel has corresponding code which instructs the software to display colors. Farid's research found that unaltered images have naturally-occurring patterns. Altered portions of images, on the other hand, ''make new statistics in the image, which can be detected,'' Farid said.

Isn't it comforting to know there's an underlying pattern to real things? And with a little effort, we can uncover it?

Medical aside: Authentic comes from the Greek (surprise) word for author, meaning unadulterated from the original, or undoctored. It makes you think, though, why "doctor" has such negative connotations as a verb. If I may say so, I did a pretty good job "doctoring" a long thigh laceration yesterday.