Handling Email with a Tcl assist

Clif Flynt

Noumena Corporation,

8888 Black Pine Ln,

Whitmore Lake, MI 48189,

http://www.noucorp.com

clif at noucorp dot com

September 30, 2010

Abstract

Handling email has become a greater and greater burden on the aver-

age computer user. The user must be able to read mail safely, including

examining the non-obvious fake mail. There is a need to sort valid from

invalid, urgent and non-critical, or just sorting into mail from mailing lists,

friends, relatives, etc.

This problem is worse for the systems administrator who also has to

deal with the quantities of mail clogging the networks, ﬁlling the disks and

ﬁlling the administrator mailbox with failure notices.

There are many tools to mitigate these problems, ranging from using a

webmail service like Gmail, letting Microsoft solve your problems by using

Outlook, or using *ix tools like procmail, etc.

All of these tools are written in compiled languages. The speed is good,

but the ability to reconﬁgure and adapt is low.

Tools written in Tcl are as fast as the average user, and are more easily

adapted to the ever-changing email environment.

1 Introduction

I’ve been dealing with email since the early 1980s, when the address clif!clif

was a unique address, and ihnp4!clif!clif would guarantee delivery.

When a big email day was 5 messages, mail and mailx were great. As

time passed, and email volume approached 20 messages a day, elm and pine

were even better.

The graph in Figure 1 shows the email server activity logged over a period

of a few weeks. The black bar is the number of connections made to the server.

The dark gray bar is the number that were rejected by hard-coded rules (in-

valid sender, invalid recipient, etc). The light gray bar is the number of valid

messages sent to users.

Figure 1: Connections, Rejections and Valid emails

The number of valid messages is much smaller than the number of invalid

attempts, but still tops 1000 emails / day for each user on the system.

Obviously, the modern connected user needs tools allow them easy access

to potentially hundreds of emails per day and to protect themselves while read-

ing them.

My preference is to host my mail server and read mail locally on a *ix sys-

tem.

This paper will discuss tools that I use to handle the volumes of mail I

receive, avoid viruses, avoid spam and control the system.

2 Reading Mail

I happen to like the mutt mailreader for general use. It provides a simple

user interface, handles MIME attachments and doesn’t put too much of its own

philosophy on how I read mail.

However, it does become awkward when receiving over 200 emails per day,

spread across several mailing lists.

One feature of mutt and other text based mail readers is that they will read

any ﬁle, as long as it is in the standard Unix mail format. This hook allows

sorting mail into multiple mailboxes and then reading the email via mutt with

a command like mutt -f tcl core list.

2.1 Trivial solutions aren’t

My presorting application started life as an application for a client who needed

a simple email application to accept email, parse it for simple commands like

”send info” and generate canned responses. The client wanted something

cheap and fast, so it was necessary to use tools I knew would work.

It seems simple to open the Unix mail ﬁle, read in the data, process it and

clear the ﬁle.

This isn’t quite the case.

The system may be putting new mail into the mail ﬁle while your appli-

cation is emptying it for instance. Your application needs to be careful about

understanding how the ﬁle locking is implemented. The existing email readers

already have locking code that works.

Having been warned that the procmail developers went through several

iterations before they got the locking right, I side-stepped the issue by using

the traditional mail text mail reader and expect to read and process the mail.

The mail mail reader is a ﬁrst generation mail reader designed for use

on a printing terminal. It has single letter commands to ﬁnd out how many

messages exist, read a message and delete a message.

An expect script to read email looks a bit like this:

spawn mail

# First message resembles:

# "/var/spool/mail/clif": 10 messages 6 unread

# Look for the : and a set of numbers.

expect {

-re {:

([0-9]+) } {

set messageCount $expect_out(1,string)

}

# Step through messages

for {set i 1} {$i <= $messageCount} {incr i} {

exp_send "$i\n"

# Read message

# Process message

}

In the real world it gets a bit more complex.

• Messages get large

The mail application it pure text. If a friend sends you a video, expect

will need to read several megabytes of text. This will overﬂow the expect

input buffer unless you use the match max command to reset the buffer

size. I’m currently using 4 million as the buffer size.

• Large messages take a long time to be read

Using mail as a front end means that text is ﬂowing through stdin and

stdout. Pushing a few megabytes through standard I/O can take time.

The expect command set timeout will increase the length of the

timeout interval. This is currently set for 500 seconds.

• Trivial parsing won’t work.

There are dozens of email systems out there. I doubt that any of them

conform to the complete RFC standard. For example, an email address

may be reported in several ways including:

From: Your Friend <[email protected]>

From: Facebook <[email protected]>

From: User Name [mailto:[email protected]]

From: Local User

From: [email protected]

The parsing code needs to understand all of them, and perhaps a few

more. (Note that the + format means disregard the part of the email ad-

dress between + and @ when delivering email.)

Tcl’s regexp command returns a 1 if it matches a pattern in the string,

and a 0 if the string does not match a pattern. Assuming that an email

address will not include the illegal characters you can use a regexp to

identify the address portion from one of those strings.

if {[regexp {([ˆ!ˆ;ˆ:ˆ ˆ\[ˆ<]+@[ˆ:ˆ ˆ>]

)} $string m1 addr] == 0} {

# Must be local address

set addr $string

}

2.2 Out of sorts

The main functionality needed from a mail sorting program is to sort the mail

into different known bins. The mail reading program starts by parsing the

headers into a (relatively) consistent format and storing the data in a Tcl array

where the index is a header ﬁeld.

% parray Parts

Parts(from) = [email protected]

Parts(subject) = New Project

Parts(to) = [email protected]

This data structure makes it easy to write a set of rules that looks like this:

set Actions {

{ reply-to {

{{[firstNsave [email protected] %s LTSP]} {}}

{{[firstNsave googlegroups %s GSoC]} {}}

{{[firstNsave [email protected] %s tk-nextgen]} {}}

}

{ sender {

{{[firstNsave [email protected] %s ev]} {}}

{{[firstNsave tcl-webmaste %s TclWeb]} {}}

{{[firstNsave snort-users %s snort]} {}}

{{[firstNsave starkit %s starkit]} {}}

}

The rule engine steps through rules in the order they appear until one suc-

ceeds. The firstNsave procedure examines the provided string and if a pat-

tern is matched, saves the message and returns a true.

This started life as a trivial application. At this point, it’s become about 700

lines of Tcl code sorting on about 80 criteria. It runs fast enough to keep up

with the mail stream, taking under a second to identify most messages.

If a message is not matched by any rule, it’s checked to see if it’s spam and

if it passes that test, it’s put into an unknown folder.

2.3 Spam, Spam, Spam, Spam

Hard rules are great for ﬁguring out if an email came from a particular mailing

list or a friend with just a few email addresses.

Hard rules don’t always work for detecting spam.

One rule that does work (frequently) is to check whether or not the To:

ﬁeld of the header matches the expected user. This doesn’t work for many

mailing lists where the To: ﬁeld is modiﬁed. (The Google Summer of Code

To: ﬁeld is sometimes To: Google Summer of Code Mentors List .

Mail from mailing lists can usually be matched by checking the Sender: and

Reply-To: ﬁelds. Once these have been weeded out, the test for To: works.

2.4 Open the pod baysean door, Hal

Given the dynamic and changing spam contents, you need a dynamic spam

detection technique, not hard-coded algorithms or even sets of keywords.

The baysean statistical tests work by checking word counts in a suspect

piece of mail against good mail and known spam mail. Put simply, the test

mail gets a point for every word that only appears in good mail, and loses a

point for every word that only appears in spam mail.

In the implementation, each word is assigned a positive value between 0

and 1 based on how often it appears in spam or good email, then the average

value for the words in an email is calculated. If the value exceeds 0.7, it’s very

likely to be spam.

A sample of words from my dictionary looks like this:

purported 0.1

brownie 0.1

bytes 0.1

TANSTAAFL 0.1

listening 0.200642136694

routine 0.200838638462

automated 0.601255325782

warranties 0.601255325782

Funds 0.846210738272

funds 0.855801144836

FUNDS .9

Sexually .9

phonesex .9

As you can see, words that aren’t commonly used by spammers get low

scores. The words that are in fairly common use get medium scores, and words

that appear in requests for assistance in moving large funds, or email suggest-

ing illicit encounters are most likely to be spam.

This proves that I’m a geek, not a playboy.

Before checking an email, it needs to be cleaned a bit. Removing punctu-

ation marks allows the code to use simple Tcl list commands and not treat a

word followed by period as a different word than one not followed by a pe-

riod.

This cleanup can be done with regular expressions, or the Tcl string replace

command. Each of these techniques worked, but both were unreasonably slow.

To improve the speed, I wrote a small ”C” extension to convert punctuation

to spaces. This extension was compiled with critcl. It was a few minutes

worth of work, and sped up the code greatly.

The addition of the string map command in more recent versions of Tcl

removed the need for compiled code to clean a message before processing it.

The word deﬁnition ﬁle contains about 80,000 words. It takes a signiﬁcant

amount of time to read this much data. This becomes an issue when many

emails arrive in a clump (for instance, when all the spam-bots owners turn on

their PC’s in the morning).

The solution to this issue was to make the baysean comparison application

a server application which the mail sorting program calls.

The simplicity of using Tcl sockets made this a obvious solution.

The guts of the server simply opens a socket in server mode, waits for a

connection and reads lines into a buffer. When an end-of-message marker is

received, it processes the message and sends the score back to the client socket.

proc readLine {channel} {

global Server

set len [gets $channel line]

# if we read 0 chars, check for EOF. Close the

# channel if we’ve hit the end of the road.

if {($len <= 0) & [eof $channel]} {

close $channel

return

}

# Watch for an unlikely end-of-message string

# If it appears, process the message and clear the buffer

# else append this line to the message being processed

if {[string match $line "EndOfMailEndOfMail---liaMfOdnEliamFodnE"]} {

set return [processMessage $Server(input.$channel)]

puts $channel $return

flush $channel

set Server(input.$channel) ""

return

} else {

append Server(input.$channel) " " $line

}

proc serverOpen {channel addr port} {

global Server

# Set up fileevent to be called when input is available

fileevent $channel readable "readLine $channel"

fconfigure $channel -buffering line

}

# Open the server for business

set server [socket -server serverOpen $Server(port)]

This reduced the lag time for sorting mail signiﬁcantly.

2.5 Double, Double, Toil and Trouble

Occaissionally, my system will end up on a spam-bot’s list. The ﬁrst time this

happened, I received over a thousand pieces of identical email each hour.

This event made it obvious that having an easily modiﬁed sorting system is

a good. It was the work of a few minutes to add a new test to see if a the body

of incoming email matched the md5 of previously received email and discard

duplicates. The tcllib md5 support just worked.

Making this the ﬁrst test in the sorting application reduced the system load

enough that other work could be done while the system was discarding email.

2.5.1 Subtracting with logs

One trick that reduces spam-bot email is greylisting. This technique refuses

email from a new sender with a ”temporary mail failure” message. The greylist

milter waits a given length of time (usually about 1/2 hour) before the email

from a new site will be accepted.

The justiﬁcation for this technique is that legitimate email processors will

accept a temporary failure for an email server and will retry. The spam-bots

will either give up after one attempt or the user may turn the machine off before

the retry interval elapses.

When I ﬁrst installed the greylist milter on my mail server it reduced the

spam count to about 1/10’th the amount of spam I had been receiving.

The commonly used greylist milter provides an easy to read ASCII log that

includes the name of systems that have attempted to deliver email, the time of

the attempt and the recipient address (which may be [email protected]).

It’s simple with Tcl to read that ﬁle and look for sites that have tried multiple

times to send email to many different addresses (legit and otherwise). These

sites (particularly ones trying to send to invalid addresses) are probably spam

bots.

It’s nice to not receive spam email, but it’s even nicer to not even have to

process it. Most Linux and Unix systems have ﬁrewall support that allows the

system to reject any connection from a host.

The Tcl exec command provides a simple way to add a ﬁrewall rule that

will drop any future connection from a given IP address. This reduces the load

on the email server (it doesn’t need to process data) and potentially crashes the

spam-bot.

After a couple hours the IPTables rules are removed. If a site is actually

legit, but being misused (for instance, an ISP that has had a number of cus-

tomer’s machines become infested), email will be received from that site again

(until the site goes rogue again).

3 Mr. MUA Check and see - is there some email,

some email for me?

These techniques work for reducing the amount of email that gets into the

system, but eventually a user would like to read the email that’s been sorted,

folded, spindled and mutilated.

The mail read/sort application puts the email into individual ﬁles using the

in Unix Mail format based on the sort criteria. All of the ﬁles go into the same

folder.

Since the only ﬁles in this folder are mailboxes, a data-driven GUI can be

built to provide a simple MUA to access the mail. The GUI learns what to put

on the buttons using the glob command, and thus does not need to be modi-

ﬁed when a new rule (and new mailbox) is added to the read/sort application.

The main portion of a button based MUA is built from code like this.

proc readMail {name} {

global Internal

exec rxvt -e mutt -f $Internal(holderDir)/$name

fillGUI

}

proc fillGUI {} {

global Internal

eval destroy [winfo children .files]

foreach f [lsort [glob -types f -nocomplain $Internal(holderDir)/

]] {

set fileName [file tail $f]

set date [clock format [file mtime $f] -format "%b %d %H:%M"]

set displ [format "%-12s %10s" $fileName $date]

set w [button .files.f_$fileName -text $displ -pady 2 \

-command "readMail $fileName"]

pack $w -side top -anchor w -expand y -fill x

}

pack [frame .files]

3.1 A MIME is a terrible thing to waste

The mutt mail reader, like pine, elm and other modern text-based mail read-

ers works for many things and supports helper applications to read various

MIME formats.

A glitch is reading HTML mail. The default choice for many folks is to in-

voke firefox, konqueror, opera or some other web browser to read HTML

mail.

The problem with this technique is that the spammers who send mail with

HTML attachments get tricky and just checking to see if this mail is really from

Paypal can be enough to mark your email address as a valid addresss and

ensure you years of overﬂowing mailboxes.

This problem led me to write my own HTML viewer. I did this initially

using Steve Uhler’s HTML widget, and later reworked it to use D. Richard

Hipp’s htmlwidget.

The advantage of this widget from a security/paranois aspect is that it’s

absolutely stupid. It does not have any HTTP support. It can’t load an image,

follow a link, run Javascript or do anything that might alert the outside world

that you’ve read the email.

What it will do is provide a highlighted version of the HTML code to make

it easy to ﬁnd the HREF tags and see if the Click Here actually goes to your

bank, or to some site in Nigeria.

Figure 2: A phish message as viewed in an HTML reader

Figure 2 shows the htmlview application displaying a phish attack for a

casual read. Figure 3 shows the same message in Examine mode, highlighting

the HREF and showing that the actual URL is abbeynational.co.uk.0e3f0bc994.com,

not abbeynational.co.uk.

This application is available at http://www.cwflynt.com/htmlview/.

4 Getting to the root of the problem

All of these tools were fairly easy to write and made handling large amounts

of email and larger amounts of spam possible.

To truly control the levels of spam, you need to catch it before it gets into

your Mail User Agent. It’s best if you can even drop something before it’s fully

sent to the mail server.

Sendmail and postfix support user written Mail Filters (milters). The

libmilter library has hooks so that external ﬁltering applications can con-

nect to the Mail Transfer Agent (MTA), receive information about the email as

Figure 3: The same image with the actual URL highlighted

it arrives and send Go/No-Go messages back to the MTA as the connection

evolves.

Architecturally, a sendmail milter is familiar to a Tcl programmer: a milter

registers a callback to be invoked when an event occurs. In this case events

include receiving a connection from a remote mail server, receiving envelope

information, reading a ﬁeld in an email header, reading the body of an email

and closing a session.

Because a sendmail application may be receiving many messages at the

same time it uses threads to control the conversations.

At this point, it looks like Tcl and Milters are made for each other.

4.1 It ain’t necessarily so

Several years ago, Michael Kirkham <[email protected]> wrote a Tcl Mil-

ter extension.

Unfortunately, the libmilter has been a moving target, and his need for

this extension evaporated, so it was not maintained. His extension is a good

start for someone doing more work with libmilter, and is available at

http://www.muonics.com/FreeStuff/TclMilter/.

I downloaded this extension, and discovered that there were some simple

issues with thread locking that could be ﬁxed by changing only a few lines of

code.

Then the fun started.

4.1.1 Hanging by a Thread

One feature of the libmilter architecture is that when an application calls

smfi main to enter the libmilter loop the smfi main function will not return

until the application exits. After this call, all activity is done via the callbacks

to new threads.

Each time a new email message arrives, a temporary thread is created and

the ﬁlter callback is evaluated in that thread. The overhead in creating a new

Tcl interpreter is fairly low, but initializing an interpreter - loading extensions

and command ﬁles, etc can be slow.

It makes more sense for a Tcl oriented mail ﬁlter to maintain a single persis-

tent fully initialized interpreter and only use the per-message threads to redi-

rect control into the primary interpreter.

Each message that the libmilter is processing gets a unique identiﬁca-

tion value that’s passed to the ”C” layer, and can be passed to the Tcl layer.

For a Tcl programmer, this makes the multiple threads rather redundant.

We’re used to dealing with this sort of construct with an upvar command:

proc doStuff {identifier value1 value2} {

upvar #0 State_$identifier State

global UniversalState

# Do what needs to be done

}

The thread::send command will let one thread send a script to another

thread to be evaluated in the target thread. The value returned by evaluat-

ing that script in the target thread will be return value of the thread::send

command.

This feature could allow a procedure running in a per-message thread to

send information to be retained to the primary, persistent thread. It also al-

lows a per-message thread pass control to a ﬁltering procedure in the primary

thread, and only return the ﬁnal result to the libmilter code.

From a Tcl script writer’s perspective, this makes more sense than writing

a ﬁlter in a thread that has to initialize an interpreter, load extensions, load

command ﬁles, etc for each email that arrives.

The architecture I ended up with looks like Figure 4.

4.1.2 If I had a hammer

This pattern works well. It has been easy to write simple milters to do white,

gray and black listing.

One common spam pattern is mail that comes from outside your domain

with an Envelope From ﬁeld from inside the domain. This could be valid mail,

if you are the size of IBM with mail servers spread around the planet. Since I

Figure 4: Multi-Thread Flow for TclMilter

only have one mail server, outside mail never comes from any of the domains I

host.

The code to reject a forged From: address is fairly simple:

proc rejectForgedEnvFrom {context addressList } {

upvar #0 mailState$context mailState

global localNames

global localIP

# Return fail if we got here without the source being

# identified.

if {![info exists mailState(sourceIP)]} {

return "SMFIS_TEMPFAIL"

}

# Assume it’s good mail

set rtn SMFIS_CONTINUE

# Look at the addresses this email claims to be from.

# This list

SHOULD

only contain one element.

foreach address $mailState(envfrom) {

# Split out the user and site portions of the From field

lassign [getNameSite $address] nm site

# If it didn’t come from inside my network, see if it

# claims to come inside my network.

if {[lsearch $localIP $mailState(sourceIP)] < 0} {

foreach id $localNames {

if {[string first $id $site] >= 0} {

# It claims to be from my site, but it isn’t

# Fail it and log it.

set rtn SMFIS_TEMPFAIL

milter::log "$context: Set TEMPFAIL because \

from=$mailState(sourceIP) and ..$address.. \

is not valid sender"

}

# If it fails,

if {$rtn eq "SMFIS_TEMPFAIL"} {

setreply 451 4.9.8 "Invalid Sender"

}

return $rtn

}

Another annoying spammer trick is sending email to every possible user on

the system - not valid users, but lists like [email protected], [email protected],

[email protected], etc.

These are never delivered, but the bounce messages clog the mail admin

mailbox (making it impossible to actually use the mailbox) and chew up band-

width.

A procedure similar to the rejectForgedEnvFrom proc removed all of

that garbage from my system.

Greylisting is not so simple a task. However, the procedure uses under 150

lines of code.

4.1.3 Making Memories

The architecture seems simple and obvious and in small tests the ﬁlters worked

just ﬁne.

When I put the code into production, it wasn’t so much fun.

Within an hour, the milter was using up almost 500 megabytes of RAM, and

it was still growing.

Libmilter doesn’t simply use threads to control email. It creates N threads

at startup, and reuses them as necessary, rather than add the overhead of cre-

ating new threads. To further complicate matters, since sendmail may need

to handle N+M threads at a time (to service N+M connections), it has facilities

to handle multiple conversations per thread. Each connection is given a thread

and a context within that thread to identify the message being processed.

The context construct that libmilter uses provides a clean just-like-new

thread for each conversation, whether the thread has been used before (or is in

use by another message currently) or not.

For something as simple as a pure ”C” language milter, this isn’t a problem.

For an interpreter like Tcl, it’s a large issue. Tcl expects a thread to be cre-

ated, used and to die, or to hang around and have some persistent data avail-

able.

The pthread library includes two functions to set and retrieve persistent

data.

• pthread

setspecific

Associates a key with a value within a given thread. Each thread can

• pthread getspecific

Retrieves the value associated with a key for the calling thread.

The Tcl thread model requires a separate interpreter for each thread, and

allows multiple interpreters per thread.

The pthread setspecific and pthread getspecific are used inter-

nally by the Tcl thread extension (version 2.6.5) to see if an interpreter has ini-

tialized the Thread Speciﬁc Data structure yet.

As part of libmilter providing a clean context for each message, it de-

stroys data associated with the pthread

calls, causing the Tcl interpreter

to believe that it needs to initialize the ThreadSpeciﬁcData structure multiple

times for a thread. This involves allocating new memory.

When a conversation is complete, the associated interpreter is destroyed,

and the thread extension cleans up one ThreadSpeciﬁcData structure, though

it may have created several.

This seems to be the mechanism behind the memory leaking.

4.1.4 The fat lady ain’t singin’ yet

Making libmilter and Tcl work together is a work in progress. Like Thomas

Edison, I’ve started by trying dozens of ideas that didn’t work. Unlike Thomas

Edison, I haven’t found the one idea that does work.

The most success to date has been to add a command to the thread exten-

sion to explicitly clean the ThreadSpeciﬁcData structure, and invoke that from

the Tcl code that is processing an email conversation.

This has reduced the memory leakage from hundreds of megabytes / hour

to merely thousands of bytes per hour. By restarting the TclMilter every 24

hours, it doesn’t swamp the email system.

The new code in the thread extension looks like this:

static int

ThreadInit(interp)

Tcl_Interp

interp; /

The current Tcl interpreter

{

...

TCL_CMD(interp, THNS"deinit", ThreadDeInitObjCmd);

...

}

static void

DeInit(interp)

Tcl_Interp

interp; /

Current interpreter.

{

ThreadSpecificData

tsdPtr = TCL_TSD_INIT(&dataKey);

ListRemove(tsdPtr);

}

static int

ThreadDeInitObjCmd(dummy, interp, objc, objv)

ClientData dummy; /

Not used.

Tcl_Interp

interp; /

Current interpreter.

int objc; /

Number of arguments.

Tcl_Obj

CONST objv[]; /

Argument objects.

{

DeInit(interp);

return TCL_OK;

}

5 The Future isn’t quite Now

I’ve used several mail ﬁlters in the past. Despite the problems with memory

leakage, I’m having more success at controlling email with my homegrown Tcl

ﬁlters than with mailScanner, greyList and others.

It’s fast and easy to add a short Tcl script to perform some new test when

I need one. The email environment changes fast enough that yesterday’s solu-

tion won’t protect you from today’s problem.

However, writing email ﬁlters is not my paying job. Work on the libmilter

project is a spare time project, and once the code reached a level of useable (if

not releasable) I had to move on to other projects.

It is my intention to return to this project. The ease of developing custom

milters in Tcl is too useful a technique to ignore.