“Get a website” they said. “It’ll get you heaps of new clients” they said. You’ve invested into a website that acts as online brochure with the aim of bringing in clients and potential sales. Its got a contact form, maybe you have a blog on there to try to show you are still relevant… Isn’t it disheartening when what feels like the only contact you get through the site is spam. It plagues your inbox, it gets filtered to your spam folder, and then you never know what is legitimate or not… Aaaargh!
We hear it a lot. “I’ve started getting a lot of spam from my website…”. Firstly, we are going to go through how on earth all the spam is getting there in the first place, and then we’ll go through a list of preventive tools that you can use to help avoid getting bogged down in the ‘noise’, allowing you to focus on your real clients, ideally without forcing them to jump through hoops to prove they are legitimate.
So… Where is all this spam coming from?
Nowadays, most spam generated on websites comes from automated processes, often referred to as Bots. Basically some clown somewhere decides it’d be great to try to get their message in front of the website owner, or in the case of blogs, even potentially in front of your target audience by getting their comments published in your website. Yeah most of the time you wouldn’t be daft enough to publish their comments, but if someone does, and they get their message / website link hosted on your site, then your SEO helps promote their SEO and they win the battle of Google-sberg. Not ideal for a clean internet. But, that little bot of cheap bot code can be run against hundreds of websites, and keep on trying with no further cost to the people who developed it, and potential for payoff, so the spam keeps rolling in. Small tweaks to the bot code gets around little changes made to try to prevent its effect. So, we’ve got to get smart.
Your standard bot simply reads the code that is used to display a form on your page. It then plucks out all input fields, populates them with some form of content, and fires them back at your website, which then emails the submitted data to you / someone. One of the simplest methods of detecting bogus entries to your site, is simply to add an extra field into your forms, that is hidden from normal users (ie using the css property “display:none” or similar, ideally applied to a class name so that its harder for the bot to recognise it is a hidden field). If you detect if there is content submitted on the hidden field, ie content that got there that no normal user would have been able to fill in, then we can pretty reliably say that the submission is bogus. This type of spam rejection is sometimes called a ‘honeypot’ – the bot sees the lure of another input to fill in, gets its hand in the jar, and is consequently found with honey stuck to it. Poor thing.
Many form plugins for popular web systems allow for honeypot style traps built in, to be enabled on forms you create with their tools – plugins like Gravityforms for WordPress. I’d recommend that when evaluating form plugins, this is a quick win option that helps sway selection of the best fit.
Captcha, Recaptcha, and annoying your users.
Sounds sinister eh? Don’t Captcha me! But what is a Captcha? You’ve likely seen them. Its those funny wee ‘Type the text you see in the image’ questions that you get on some forms, and half of them aren’t even readable, and you just get that little bit frustrated ‘cos its effort. Its not even for your benefit!
So Captcha is the term for those image recognition questions. Why do we have them? Well – because they are hard. Not just for humans – they are really hard for computers to figure out. How do i tell a line or shape from a letter of the alphabet. Humans are great at pattern recognition, especially when trained to do it since around the age of 5. Computers? The harder the image (ie warped text, lots of foreign objects, characters without solid borders etc) the less likely the computer / bot will be able to resolve it to a satisfactory, correct answer. This method works well at preventing spam… but also for putting off legitimate clients, unless they have good enough reason to contact you to move past the hurdles you put in front of them.
So then Google put some weight behind ReCaptcha – a similar concept, but with some extra smarts behind it. Instead of just throwing an image onto the page, it uses some code that is only rendered in the web browser of your visitor, and uses that to then add the verification image, and outsources validation of that image to the ReCaptcha service. Pretty cool stuff. Still a pain to fill in for your users, but doing it this way gives the same powerful tools to more forms systems out there on the web, in a consistent way, and has good rejection rates.
The latest version of ReCaptcha doesn’t show images anymore – either a wee tickbox to tick to show you are a human, or an option to not show anything at all, and just rely on neat detection algorithms. Many websites rely on this method – its not perfect, but it does a pretty good job against most incoming spam.
Are there ways to weed out spam without relying on user entry / client side tricks?
How good of you to ask. Why Yes. Yes there are. There are a number of services out there that you can forward the content of your submitted data to, and they run filters on it and can detect if the content is obvious spam (anyone wanna buy some viagra or cialis?). In the WordPress world the most obvious one is Akismet. The great things about these tools is they can be run retroactively on previous comments in your system to weed out spam from them as well. Very helpful. Another we have had great success with, that allows integration with a variety of web systems is Cleantalk.
These third party filtering systems use learning filters to target the ‘in season’ spam content trends, and block them, so you don’t need to stay on top of them. They aren’t perfect – it is possible you will get some false positive recognitions (legitimate messages that are seen as spam on content analysis alone) but typically they give good interfaces for whitelisting content or users so the systems can learn from their mistakes.
So… What should we do?
The best approach to most problems is multi-faceted. The options represented above all attack spam submissions in different ways. Traps, challenges, and filters. We have found our most reliable setups have been mixtures of each, depending on the context of what we are looking to protect. To prevent spam in blog comments and contact forms / calls to action: a honeypot to catch most of the bots, and cleantalk to catch the ones that get through is a good fit. For user registration forms or user login protection – recaptcha works well as your client already knows they have work to do to get at the goodies in store once they put in the effort to get past your hurdles.
Find what won’t annoy your users, and use that. There is plenty of options out there. Still stuck, or not sure how to implement your changes? I know some people who could help.