Engineering

Black hole for Referrer Spam

After the kids go to bed, I listen to NPR podcasts while cleaning up the house before getting back to work. Tonight I heard a story about a computer program used to trap telemarketers that made me laugh. The whole thing got me thinking about an old idea I had for dealing with referer spam.

NASA Goddard Space Flight Center, https://www.flickr.com/photos/gsfc/5740471915/

Real quick “referer spam” is a technique spammers use to get URLs to show up in your analytics information. I believe the goal is you’ll see some URL that is sending your site lots of traffic so you visit it and spammers make a penny (or something). It also skews your analytics data so you have to do work to filter the noise out.

Frustrating to say the least.

One recommended way to thwart referer spam is to add rewrite rules to abort any connection from known spammers. You can do this with IIS with something like:

<rewrite>
  <rules>
    <rule name="Abort referer spam" stopProcessing="true">
      <match url=".*" />
      <conditions>
        <add input="{HTTP_REFERER}" pattern="(example\.com)|(example\.org)|(example\.net)" />
      </conditions>
      <action type="AbortRequest" />
    </rule>
  </rules>
</rewrite>

Of course, maintaining the list (there are better ways to encode the list, this was just a quick example) takes constant effort. Ultimately, it’s an arms race with a pig. And everyone knows if you do battle with a pig, you both get muddy and the pig likes it.

So I started wondering if there was anyway to make the referer spam expensive for the spammer. The idea I came up with was to redirect the spammer to a “black hole server”. So, instead of using AbortRequest, use Redirect to send the spammer to a server that accepts the connection but never responds or responds very, very slowly.

There are a lot of problems with the idea.

First, the black hole server will need to handle lots of connections. If I recall my network programming correctly, connections are a finite resource for a server. But a lot has changed since I wrote low-level networking code in college so maybe this is a solved problem today.

Second, I’m not sure the spammers need to wait for the response. Some spam is done simply by connecting to the server. However, if I understand Google analytics, that’s all based on JavaScript executing. If so, then the spammers do need to wait and the black hole would be effective.

Third, spammers could start to recognize known black hole servers and abort their connection instead of following the redirect. The only solution is to have many, many black hole servers.

Finally, I understand some spammers bypass your website completely and throw spam directly at Google Analytics. Not much to be done here but let Google do their own anti-spam thing I suppose.

It’d be cool if there was a service that offered black holes and kept the known spammer list up to date on my site somehow.

So what do you think? Would this work? Has someone already done it?

In the meantime, keep coding. You know I am.