ICANN ICANN Email List Archives

[gnso-ff-pdp-may08]


<<< Chronological Index >>>    <<< Thread Index >>>

[gnso-ff-pdp-may08] Spam Filtering Tutorial - How to get high levels of accuracy

  • To: "gnso-ff-pdp-May08@xxxxxxxxx" <gnso-ff-pdp-May08@xxxxxxxxx>
  • Subject: [gnso-ff-pdp-may08] Spam Filtering Tutorial - How to get high levels of accuracy
  • From: Marc Perkel <marc@xxxxxxxxxx>
  • Date: Fri, 08 Aug 2008 07:31:52 -0700


This is an educational message to give non-technical people some insight into how spam filtering becomes highly accurate and to address using similar technology for automated domain take down in a way that doesn't scare registrars and free speech advocates.

Spam isn't usually detected by triggering a single rule. Unless the rule is something that only spammers do, and there are some of them. So - let's say for example that we look at a rule that is accurate to 1 in 1000. Is that good enough? No - it isn't. It means that if you process a billion messages you are going to make a million mistakes.

However - suppose you have a second rule that is also 1 in 1000 error rate. Then if they hit both rules you are up to 1 in one million. With a third rule it's 1 in a billion. (yes - this is a simplification)

Then there are white rules. Indicators that the message is not spam. There are many instances where "spammer never do this" or "spammers can't do this" that you can look at to take good email out of spam testing and pass it. So once you apply your "this is probably spam" rules you then subtract out the "this isn't spam" rules any what you have left is highly accurate.

It's like gambling in Vegas. In the long run the casino always wins if you play long enough. There is no jackpot so large that if do don't walk away that the casinos won't win back if you keep gambling. Spam filtering is like that. The more information you have and the more rules you apply the more accurate it becomes.

So - here's how it applies to fast flux. FF is a strong indicator of phishing. Probably less that 1 on 100,000 fluxing domains is legitimate. But it is still very important to protect the free speech of the one in 100,000. I, for example, would not want to be suspended just because I reduced my TTLs because I was going to move servers to a new data center. And we wouldn't want to block people in Tibet from circumventing the Chinese firewall. But - FF is still a strong indicator and would be a valuable piece of information as part of a bigger picture.

If the FF is spam bot driven then that too is a strong indicator. And the combination of spam bot driven and FF is a very strong indicator. But is it strong enough? If the from address is on a list of banks that are often spoofed and the FCrDNS of the sending host is an IP address in China, that makes a very strong case.

Then there are white rules. Rules that prevent take down of good domains. These rules are used to help protect from false positives. If the domain in question is 10 years old then it would be blocked from automatic take down. We can come up with a lot of "white rules" for domains that would never be available for automated take down no matter what the "black rules" were. So we would have a narrow set of domains that fall outside the white rules that are available for take down if there are enough black indicators to do so.

So - at this point it looks very safe - but would we catch anyone? I think so because criminals are limited in what they can do. And most phishing activity might still be withing the black rules minus the white rules range. So this would still be very effective.

Even with that outside the range (suspicious but not quite there) a message can be sent to a real person alerting them to look into a possible abuse. Then they can be evaluated manually to determine if they need to be suspended or not.

And - you have to accept that you are never going to catch all of them. So if you can cut phishing by 10%, that's progress. If you then come up with another rule that cuts 5% more - that's more progress.

Anyhow - hope this educates those of you who don't understand some of the technology and thinking about how this is done.




<<< Chronological Index >>>    <<< Thread Index >>>

Privacy Policy | Terms of Service | Cookies Policy