Blocking a range of Spam Forum posts with Regex and Request Filters

Aug 28 2014

I had a period where the SMTP settings on my website weren’t working correctly.  This meant I was getting no notifications for things like new user registration, new forum posts, new blog comments and the like.

Leave anything out in the open like that for a while, and the bad elements sneak up on it.  And so when I was fixing the SMTP I noticed to my astonishment that a spammer had snuck up on a seldom-used forum and let loose with a bot that had created 4,000 spammy posts linking to all the usual hits of the internet link world.

I fixed that pretty easily by running a delete.  The easiest way to do that is to do it properly - I found the stored procedure that my forum software uses for delete, and created a T-SQL cursor around that procedure.  Doing it this way ensures you aren’t wrecking any relational links between tables.  It’s always tempting to just let rip a delete on the ‘posts’ table, but it’s better to take the time and work out how to do it properly.

I thought that was the end of it, but I was wrong.  A whole host of bots of greater and lesser nature had indexed all these spammy forum posts, and were returning to check on them (no doubt, these bots were excited by the fake handbags and other exciting offers).   But now my forum software was throwing exceptions for each of these bot return visits – perhaps I didn’t do that delete as well as I thought.

My preferred fix in any situation like this – you’re getting unwanted bot traffic for unwanted URLS – is to start returning 404 errors for the unwanted URLs.

The easy way to do this is to set up a request filter (this will go by many names in many different platforms) – and use a regex pattern.  But how to set up a pattern which will let through the ‘clean’ URLs and block the bad ones?

The way I approached this was by working out the range of the deleted Posts.  This was as simple as selecting the top 10 posts – which I knew included the range of the now-deleted posts.  This gave me the following results

13355

13718

13763

13764

13765

 

From that, I can tell that the bogus range was from 13356-13718, then 13718-13762.

So, how can you define a numeric range in a regular expression?

Here is what I did.

First, here’s the expression I used:

/postid/(?!(13718|1376[3-9]|137[7-9][0-9]))13[4-7][0-9][0-9]

Regular Expressions are inelegant in appearance at the best of times.  Let’s break it down:

The first part just sets up the pattern for the URL I want to block:

/postid/_a post id_

The last part gives a broad range that I want to block, using fixed-position numbers, then digit ranges.  So we have 13[4-7][0-9][0-9] which matches any number from 13400 to 13799.  The bad posts actually started from 13356 but I’m just trying to block ~4000, not get every last one – maybe I will come back to that.

This expression works great, but there is one problem – it includes the post in the middle of the deleted range, and it includes any new posts that go on from 13763 up.

So I am going to re-include those in my blocked range by using a negative lookahead – this is the (?!  ) bracketed expression.  It means ‘don’t match if you match the pattern in the brackets.   Then, in my brackets, I have an exact match

13718

followed by a | – which means ‘OR’ in Regex

then I have:

1376[3-9]

which matches 13763 to 13769, and then finally:

137[7-9][0-9]

which allows any number from 13770 up to 13799 – which ends the range which is blocked.

So the regular expression is used to match a broad range from 13400 to 13799, then exempts the special case 13718, and the range from 13763-13799.

And with that, the spammers URLs are destroyed and unreachable forever, while the legitimate forum posts are free to go about their business.

The power of Regex can be very fun sometimes – especially when you’re defeating the ranks of the spammers and moochers of the internet!

Comments

Hi, I'm Bruce Chapman, and this is my blog. You'll find lots of information here - my thoughts about business and the internet, technical information, things I'm working on and the odd strange post or two.
Connect with Bruce Chapman on Google+
Follow me on Twitter