Comparision of two schemes for email representation in spam filtering

Contact: Slavisa Sarafijanovic

Site: Comparision of two schemes for email representation in spam filtering

We are developing a novel antispam system based on the workings of the human immune system [1]. Our system uses a collaborative spam filtering technique (and some other techniques) – for a better spam filtering it exploits similarity among emails that belong to the same spam bulk. In practice, spammers change a bit (obfuscate) the original spam message when creating many copies of it that they send in a bulk to many victim email recipients. Nilsimsa [2] is the most known [3] and used [4,5] method to create binary similarity-hash signatures from emails that transform similar emails into similar binary strings (small Hamming distance). These signatures are then exchanged and used for finding evidence whether emails belong to a bulk. Recently, we proposed a scheme [6] for representing email content, which should have better properties regarding collaborative spam filtering. Especially, it is expected to be well resistant to heavy obfuscations of spam emails.

Goal of the project: Design and perform appropriate experiments to compare our scheme for representing email content to the Nilsimsa scheme.

Required skills: C programming (due to the code reuse); Knowing Python is a plus.


[1] MICS networked-software-systems project: collaborative spam filtering based on artificial immune systems approach. Web site:

[2] cmeclax/nilsimsa.html

[3] E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, “An Open Digest-based Technique for Spam Detection,” in Proc. of the 2004 InternationalWorkshop on Security in Parallel and Distributed Systems, San Francisco, CA, USA, September 15-17, 2004.



[6] “METHOD TO FILTER ELECTRONIC MESSAGES IN A MESSAGE PROCESSING SYSTEM”, Slavisa Sarafijanovic and Jean-Yves Le Boudec, US patent No 11/515,063, filed Sept 5, 2006.

Benefits: Learn about email and spam, a popular topic. Develop experiment design and performance analysis skills.

Domain: Formal analysis, methods, frameworks; Other; Security