Better Naive Bayes classification for high-precision spam detection

Yang Song, Aleksander Kołcz, C. Lee Gilez

Research output: Contribution to journalArticle

30 Citations (Scopus)

Abstract

Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.

Original languageEnglish (US)
Pages (from-to)1003-1024
Number of pages22
JournalSoftware - Practice and Experience
Volume39
Issue number11
DOIs
StatePublished - Aug 10 2009

Fingerprint

Collaborative filtering
Electronic mail
Learning systems
Classifiers
Agglomeration
Internet

All Science Journal Classification (ASJC) codes

  • Software

Cite this

Song, Yang ; Kołcz, Aleksander ; Gilez, C. Lee. / Better Naive Bayes classification for high-precision spam detection. In: Software - Practice and Experience. 2009 ; Vol. 39, No. 11. pp. 1003-1024.
@article{2f4122bf667e4740af56fed5dd979266,
title = "Better Naive Bayes classification for high-precision spam detection",
abstract = "Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.",
author = "Yang Song and Aleksander Kołcz and Gilez, {C. Lee}",
year = "2009",
month = "8",
day = "10",
doi = "10.1002/spe.925",
language = "English (US)",
volume = "39",
pages = "1003--1024",
journal = "Software - Practice and Experience",
issn = "0038-0644",
publisher = "John Wiley and Sons Ltd",
number = "11",

}

Better Naive Bayes classification for high-precision spam detection. / Song, Yang; Kołcz, Aleksander; Gilez, C. Lee.

In: Software - Practice and Experience, Vol. 39, No. 11, 10.08.2009, p. 1003-1024.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Better Naive Bayes classification for high-precision spam detection

AU - Song, Yang

AU - Kołcz, Aleksander

AU - Gilez, C. Lee

PY - 2009/8/10

Y1 - 2009/8/10

N2 - Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.

AB - Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.

UR - http://www.scopus.com/inward/record.url?scp=67650834914&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650834914&partnerID=8YFLogxK

U2 - 10.1002/spe.925

DO - 10.1002/spe.925

M3 - Article

AN - SCOPUS:67650834914

VL - 39

SP - 1003

EP - 1024

JO - Software - Practice and Experience

JF - Software - Practice and Experience

SN - 0038-0644

IS - 11

ER -