<<<
Chronological Index
>>> <<<
Thread Index
>>>
[gnso-ff-pdp-may08] Spam Filtering Tutorial - How to get high levels of accuracy
- To: "gnso-ff-pdp-May08@xxxxxxxxx" <gnso-ff-pdp-May08@xxxxxxxxx>
- Subject: [gnso-ff-pdp-may08] Spam Filtering Tutorial - How to get high levels of accuracy
- From: Marc Perkel <marc@xxxxxxxxxx>
- Date: Fri, 08 Aug 2008 07:31:52 -0700
This is an educational message to give non-technical people some insight
into how spam filtering becomes highly accurate and to address using
similar technology for automated domain take down in a way that doesn't
scare registrars and free speech advocates.
Spam isn't usually detected by triggering a single rule. Unless the rule
is something that only spammers do, and there are some of them. So -
let's say for example that we look at a rule that is accurate to 1 in
1000. Is that good enough? No - it isn't. It means that if you process a
billion messages you are going to make a million mistakes.
However - suppose you have a second rule that is also 1 in 1000 error
rate. Then if they hit both rules you are up to 1 in one million. With a
third rule it's 1 in a billion. (yes - this is a simplification)
Then there are white rules. Indicators that the message is not spam.
There are many instances where "spammer never do this" or "spammers
can't do this" that you can look at to take good email out of spam
testing and pass it. So once you apply your "this is probably spam"
rules you then subtract out the "this isn't spam" rules any what you
have left is highly accurate.
It's like gambling in Vegas. In the long run the casino always wins if
you play long enough. There is no jackpot so large that if do don't walk
away that the casinos won't win back if you keep gambling. Spam
filtering is like that. The more information you have and the more rules
you apply the more accurate it becomes.
So - here's how it applies to fast flux. FF is a strong indicator of
phishing. Probably less that 1 on 100,000 fluxing domains is legitimate.
But it is still very important to protect the free speech of the one in
100,000. I, for example, would not want to be suspended just because I
reduced my TTLs because I was going to move servers to a new data
center. And we wouldn't want to block people in Tibet from circumventing
the Chinese firewall. But - FF is still a strong indicator and would be
a valuable piece of information as part of a bigger picture.
If the FF is spam bot driven then that too is a strong indicator. And
the combination of spam bot driven and FF is a very strong indicator.
But is it strong enough? If the from address is on a list of banks that
are often spoofed and the FCrDNS of the sending host is an IP address in
China, that makes a very strong case.
Then there are white rules. Rules that prevent take down of good
domains. These rules are used to help protect from false positives. If
the domain in question is 10 years old then it would be blocked from
automatic take down. We can come up with a lot of "white rules" for
domains that would never be available for automated take down no matter
what the "black rules" were. So we would have a narrow set of domains
that fall outside the white rules that are available for take down if
there are enough black indicators to do so.
So - at this point it looks very safe - but would we catch anyone? I
think so because criminals are limited in what they can do. And most
phishing activity might still be withing the black rules minus the white
rules range. So this would still be very effective.
Even with that outside the range (suspicious but not quite there) a
message can be sent to a real person alerting them to look into a
possible abuse. Then they can be evaluated manually to determine if they
need to be suspended or not.
And - you have to accept that you are never going to catch all of them.
So if you can cut phishing by 10%, that's progress. If you then come up
with another rule that cuts 5% more - that's more progress.
Anyhow - hope this educates those of you who don't understand some of
the technology and thinking about how this is done.
<<<
Chronological Index
>>> <<<
Thread Index
>>>
|