Can I create a rule to detect disguised or misspelled words

G

Guest

I routinely receive numerous junk mail that contain disguised text in the
body such as "C e I g A g L d I e S q" for Cialis or "V t I b A x G u R f A
f" for Viagra. The extra letters in the words are random so it is impossible
to create a rule for every possible combination. Is there a way to write a
rule, for instance using wildcards, to filter this type of spam?
 
M

Milly Staples [MVP - Outlook]

No, don't even try. Spammers are always one step ahead of rules.

Instead, get a good third party Spam blocker - I usually recommend SpamBayes
from Sourceforge.net - it is trainable and very good. I have been using it
for almost 2 years and, in combination with Outlook 2003 built-in spam
filtering, I almost never see spam in my in-box. Maybe twice weekly, easily
dealt with.

And the price is right - free from open source.


--
Milly Staples [MVP - Outlook]

Post all replies to the group to keep the discussion intact. All
unsolicited mail sent to my personal account will be deleted without
reading.

After furious head scratching, ival50 asked:

| I routinely receive numerous junk mail that contain disguised text in
| the body such as "C e I g A g L d I e S q" for Cialis or "V t I b A x
| G u R f A f" for Viagra. The extra letters in the words are random
| so it is impossible to create a rule for every possible combination.
| Is there a way to write a rule, for instance using wildcards, to
| filter this type of spam?
 
V

Vanguard

"Milly Staples [MVP - Outlook]"
No, don't even try. Spammers are always one step ahead of rules.

Instead, get a good third party Spam blocker - I usually recommend
SpamBayes
from Sourceforge.net - it is trainable and very good. I have been
using it
for almost 2 years and, in combination with Outlook 2003 built-in spam
filtering, I almost never see spam in my in-box. Maybe twice weekly,
easily
dealt with.

And the price is right - free from open source.


--
Milly Staples [MVP - Outlook]

Post all replies to the group to keep the discussion intact. All
unsolicited mail sent to my personal account will be deleted without
reading.

After furious head scratching, ival50 asked:

| I routinely receive numerous junk mail that contain disguised text
in
| the body such as "C e I g A g L d I e S q" for Cialis or "V t I b A
x
| G u R f A f" for Viagra. The extra letters in the words are random
| so it is impossible to create a rule for every possible combination.
| Is there a way to write a rule, for instance using wildcards, to
| filter this type of spam?


Unless the Bayes filter is learning from some other anti-spam filtering
as to what is spam or not, how does the one-time instance of a
misspelled word provide enough weighting of that misspelled word in its
database? Bayesian filtering doesn't mark as spam a message simply
because you've never encountered a particular word before. Bayesian
works by weighting words (would be nice if phrases were included, too)
but you can't weight a word that you've only encountered once except to
give it a default weighting (which means it is neutral).

Say I send you an e-mail with "syzygy" (yep, it's a real word and not
misspelled in this case). How does Bayes filtering know it is spam
since you've never had that word before in the database for the word
weighting? In the next e-mail, the word was mispelled as "sysygy" but
it is still the very first occurrence of that word in any of your mails
and so it receives default weighting (of neutral). That's why it takes
time for Bayes filtering to *learn* and it actually has to encounter the
words that it will learn. Bayes doesn't work against one-time
occurrence of a word. More likely is that OTHER words within the spam
will get included in the weighting and have been encountered before so
their weighting *changes* to provide a overall bias on the message to
determine if it is spam or not.

While "it might work for you" may make Bayes filtering alone look like
it is functional, it has to learn and keeps learning, and misspellings
eliminate the learning (i.e., weighting). That's why spammers use the
trick. You're hoping something *else* within the spam will achieve high
enough weighting (due to reoccurrence of those words) to mark the mail
as spam (and hence change the weighting of the other keywords used for
testing, and only maybe might the misspelled word be one of those other
keywords). Of course, all that Bayesian weighting is useless against
spam that hides its content inside of attached .gif or .jpg files.

Over time, what I've noticed is that Bayesian filtering can be quite
useful but it can also generate too many false positives or false
negatives. After all, it is guessing! You'll probably want to
incorporate some other methods of spam detection than just Bayes.
 
J

John Blessing

ival50 said:
I routinely receive numerous junk mail that contain disguised text in the
body such as "C e I g A g L d I e S q" for Cialis or "V t I b A x G u R f
A
f" for Viagra. The extra letters in the words are random so it is
impossible
to create a rule for every possible combination. Is there a way to write
a
rule, for instance using wildcards, to filter this type of spam?

Trying to block spam is a pointless waste of time. Try here for some advice:

http://www.lbetoolbox.com/how-to-stop-spam.htm


--
John Blessing

http://www.LbeHelpdesk.com - Help Desk software priced to suit all
businesses
http://www.room-booking-software.com - Schedule rooms & equipment bookings
for your meeting/class over the web.
http://www.lbetoolbox.com - Remove Duplicates from MS Outlook, find/replace,
send newsletters
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top