How to Host a Website on AWS in 5 Minutes
Quickly learn how to host a website on AWS, whether it's a static site, CMS, or web/mobile app, in under 5 minutes.
Learn how the development team at Metal Toad combats against scraper bots and save the websites performance. Learn how to save your website from scrapers.
The Internet has always been a difficult place to run a website. From the earliest days, there have always been malicious actors attempting to compromise websites for their own purposes, whether that is to take advantage of website users, spread their own propaganda, or use the site itself as an intermediary platform to launch attacks on other sites. Even success can be its own problem for a smaller site. A site brought to the attention of a large number of interested people can suffer the Slashdot Effect, otherwise known as the ‘Hug of Death’, be it from Slashdot itself, Hacker News, the main page of Reddit, or even 'just' a larger, more popular site such as Daring Fireball.
Automation and scaling, like anything else, have made things worse. The rise of the automated crawlers, including search engines, making regular passes over entire sites lead to a variety of ways of blocking said traffic: WAF and HTTP servers forbidding traffic with certain characteristics, or flatly banning specific IP addresses, then CIDR blocks, and even entire geographic regions or network operators via ASN blocks.
For traffic that was potentially wanted but only under certain rules, something of a modus vivendi was reached for automated crawlers and other programmatic users. Automated systems that wanted to potentially make use of a website (either for search indexing or other reasons), would follow the rules set out by the target website. This agreement was expressed in the robots.txt file, first informally starting in 1994, and eventually formalized in 2022. Not every automated system followed it, but the majority of the really big automated systems, particularly Google, did, allowing for everything from enormous search engines to small hobbyists to use automation to interact with a site without necessarily putting too much pressure on that site, rendering it too expensive to run.
This was just in time for things to get exponentially worse.
Modern Generative AI models need content for training. The major ones have an effectively insatiable demand for new training data from anywhere they can get it, and without getting diverted into the legal ramifications of this demand, they've proven willing in the past to do 'whatever's necessary' to go and get that data. This includes many scrapers ignoring the existence of robots.txt and its rules. Worse, there is an enormous amount of capital flowing into AI startups and, if not the startups themselves, then services seeking to sell to these startups. And the more sites have used traditional methods to try to fend off unwanted scrapers, the more valuable scraper services which bypass these methods have become. The result has been vastly increased pressure on smaller sites in the past six months, with sites that have never had particularly large amounts of traffic seeing site requests increase dramatically.
The result is what might be expected: a site that had run perfectly well for years on a small server instance without requiring a CDN or much more than a basic WAF in front of it was suddenly deluged with traffic. In instances where the site didn't itself collapse due to exhaustion of available computing or database resources, just the bandwidth costs of serving the site increased enormously, for no benefit to the site owners.
Metal Toad, in its role as an MSP, has seen it happen to more than one of our clients, and I will cover what happened and how we responded. One of the sites I can speak about in detail, another I will outline what happened but keep the site name private. Each used a different method of responding to the traffic, and each response was successful in its own way.
For sites that do not or cannot use either Amazon Web Services or Cloudflare I will also outline another option, Anubis, albeit one I do not have significant personal experience with at the writing of this article.
RPGnet is a long standing site for people interested in Table Top Role-Playing Games (TTRPG), the most famous example of which is Dungeons & Dragons. The site was first launched in 1996, almost 29 years ago (as of the time of writing). The most active part of the site is a forum for user discussion, but there is a significant amount of other user-generated content on the site, including game reviews and a wiki. All of these are attractive to AI Bot Scrapers for training purposes, albeit on a smaller scale, for much the same reason that Reddit is one of the most attractive targets: it's a stream of frequently updated user action which can be fed to models for training, particularly for recently published subjects: new games.
RPGnet came under Metal Toad's site management within the past couple of years. It brought with it, like any site that's been around that long, a variety of underlying technologies which aren't necessarily easily replaced and which, if replaced too quickly or hastily, could break the site and drive its user traffic elsewhere, removing the reason for the site's existence. A compounding problem was that while the site itself is quite valuable for its users, it is a relatively small site and therefore the budget for upgrades and engineering work on the site is itself comparatively small. As such, bringing its existing software up to date and putting additional security measures in place to protect the site without breaking user functionality was the primary goal. This has allowed the site to continue to function, with small changes made here and there, while creating a long term roadmap for serving the existing user base and adding functionality to attract new users.
This 'move slowly and do no harm' strategy then ran into the reality of a massive traffic spike over a period of hours-to-days. It came from many geographically disparate locations, from many data centers, and from a variety of networks, including networks where our existing user base resides and originates their requests. This many-origin strategy of the scrapers was what rendered our previous block/ban tools ineffective. Like many other sites, we've gone from banning single malicious IPs to CIDRs and eventually to certain geographic and even ASN blocks, but in this case there were so many different origin IP addresses and networks that there was no feasible way to block this traffic without effectively rendering the site inaccessible for most, if not all, of the user base.
We were not in a position to add bandwidth and compute to serve the additional demand: the forum software in particular, while perfectly functional for its user size, was not in a position to scale horizontally, and we were capped on the origin size for the server and database. Nor was caching through a CDN an option, even if we'd been willing to eat the bandwidth cost: the bots vary requests in order to break caching, and since the most popular part of site is the forum, that part is relatively dynamic.
We were also in something of a hurry. The site was increasingly unusable and the user base was quite rightly unhappy with this situation. In addition, the site funds itself partly through advertising, and while the site was serving an enormous amount of traffic to the scrapers, it wasn't generating any ad revenue.
One of our potential options, moving the site behind AWS' Bot Control service, was not ideal. The site's origin is outside of AWS and routing that origin back into AWS would've incurred significant extra cost with the addition of a load balancer, WAF and using CloudFront as a CDN. Fortunately there was another option: Cloudflare.
(At the time I was not aware of the existence of Anubis, but in retrospect we would almost certainly have picked Cloudflare over Anubis as the time pressure and labor cost involved outweighed the do-it-yourself advantages).
Cloudflare was attractive in the moment for a variety of reasons: ease of integration, low cost, and particularly its much ballyhooed 'Super Bot Fight' mode. That last was definitely advertising, but it was advertising that worked well enough to get us interested. We temporarily pointed the site to a static maintenance page (AWS S3 + CloudFront) and began working to migrate the site behind Cloudflare. We ran into a few speed bumps.
The first speed bump was DNS. Cloudflare essentially wants to control DNS for the sites it fronts as a CDN, something not required by AWS (though it does make certain things easier, and others possible). We had been hosting RPGnet's DNS in Route53 while we prepared to implement some of the long term changes. We immediately had to export the zone from Route53, something Amazon makes harder than it should be, correct some of the records, then import it into Cloudflare. We then had to change the glue records for the domain, and validate that nothing critical broke.
Once DNS had been moved, we began cutting over the various sites to the backend. We quickly found that some of our previous mitigation measures prevented this from working correctly, as we’d been blocking certain ranges that weren’t ‘user oriented’. Adding Cloudflare’s IP Ranges into the allow-list for the origin fixed this. We also simplified origin access at this time, preventing everything but Cloudflare and some internal IP ranges from accessing the origin at all. We also moved WAF functionality primarily into Cloudflare, making use of the WAF integration with its anti-scraping functionality.
The final server-side piece for each of the sites hosted on the origin was to enable or validate compatibility with Cloudflare’s caching modes. This is where an old stack can be a problem. While its software did support generation of caching headers, the forums in particular had never needed it to function. Fortunately the HTTP daemon software had native (if unused) integration with forum software, so it was mostly a matter of enabling it and then tweaking the configuration until it worked.
Cloudflare has multiple levels of subscription. We chose ‘Pro’ mode in order to give us access to ‘Super Bot fight mode’. This was chosen for expedience: Cloudflare has a lot of options and we were in a hurry to try to get the traffic down and the site back online and usable, and ‘Super Bot fight mode’ came with a pre-packaged set of rules that looked like they would enable this. We also opted to turn on some of the other AI scraper rules, including ‘Block AI Bots’. There is inevitably some overlap in these rules, and some of the ‘Super Bot fight mode’ rules (such as ‘allow verified bots’) might even override ‘Block AI Bots’. The initial result, however, was a significant reduction of traffic. This is what we were looking
In addition, and at the same time, we implemented some other measures which likely had at least some impact individually, but likely overlapped many other rules. Ideally we would go back and tune these rules, but for now it is ‘good enough’. This included (but isn’t necessarily limited to):
Finally, I would like to thank the SpaceBattles admins for providing us a copy of their current Cloudflare block list for IPs, networks, and ASNs. We implemented it and it is providing the basis for a non-trivial number of our block lists. We very much appreciate them being willing to help out in our time of need.
The results were better than ‘good enough’. Our traffic dropped noticeably as a result of block rules + the anti-scraper measures. Use of challenge mode allowed users in certain countries that we would have otherwise had to block to continue to use the service without making use of VPNs. As a side effect, being forced to enable CDN caching has resulted in a dramatic perceived site speed up for our user community.
It isn’t perfect: there are some minor bugs that continue to manifest here and there. In addition, while the site can serve advertisements properly, effects on search engines and other services that drive traffic to the site are still not fully understood. And the site is now effectively reliant on Cloudflare to continue to function unless we want to either move it behind another anti-bot service or implement a local solution such as Anubis.
Approximately two weeks after RPGnet came under fire, one of our other clients had essentially the same thing happen. The site in question, rather than being the main client site itself, is the public front end for an Open Source project maintained by the client. Unlike RPGnet, the site does not host a forum or have what could be described as a swiftly changing site, and the main repo is hosted elsewhere, leading to a very low normal traffic rate. What traffic there is, however, is important to the client and the community using the software. Aside from the cost of serving the additional traffic, the increased traffic load effectively DDOSed the site. This proves that a slowly-changing site which should, in theory, get crawled a few times and then rarely thereafter as it doesn’t change very fast is still going to be a target. Its mere accessibility gets it put on the target list.
Like RPGnet, the application tech stack is old, and not friendly to being swiftly modified. Unlike RPGnet, the site resides in Amazon Web Services and as such already makes use of multiple AWS infrastructure services: an application load balancer, CloudFront as a CDN, and WAF. We were able to address the problem by using AWS Bot Control.
Here, initial implementation was simple: we were already making use of AWS WAF, including managed rulesets. We simply added the managed Bot Control ruleset to the rule group. Mirroring some of the same things we’d done in Cloudflare with RPGnet, we also updated some of the IP, network and ASN block lists. We already were using some of the geographic access controls, but we further turned on ‘CAPTCHA’ and ‘challenge’ modes for certain geographic origins, experimenting with first one, then the other.
Much like RPGnet, the result was an immediate drop in traffic, and the site began working properly for users. It’s worth noting that the Bot Control managed ruleset is an extra-cost item in addition to running the WAF itself. Amazon has other, security-related managed rules which can be added at no additional cost, but Bot Control costs $10/month per WebACL. In addition, very high traffic can result in extra fees separate from the Bot Control ruleset, with the CAPTCHA option costing significantly more than Challenge, leading to our client choosing to go with Challenge mode for the various geographic rulesets, despite CAPTCHA having a slightly better block rate.
AWS also provides a ‘Targeted’ mode which theoretically uses machine learning to adapt on the fly to new threats. It is also significantly more expensive than the ‘Common’ mode, and we chose not to implement it. Like most things in AWS, the Bot Control managed ruleset itself has both individual options (similar but not identical to those in Cloudflare) and is itself versioned (unlike Cloudflare), meaning that it is incumbent on the WAF manager to test and update them as AWS updates the rules.
The end result was implementation of Bot Control significantly reduced traffic to the site. As we were already using AWS as the site infrastructure, it was also significantly faster to implement: two hours from the decision to go ahead until the measures were in place and the site had recovered to full functionality. (Fixing the Infrastructure as Code that manages the site infrastructure would take significantly longer, but as time was pressing, the initial setup was done through the console, and was ‘good enough’.)
Further experimentation with Bot Control mode also led us to conclude that implementing ‘Challenge Mode’ led to a significant drop in traffic, and could be done without requiring the full Bot Control managed ruleset. It didn’t bring traffic back to pre-attack levels, but it was worth doing on its own.
Regardless of whether implementing full Bot Control or just going with Challenge (and potentially CAPTCHA) mode, we would now recommend this as mitigation for any client site already using AWS as its network infrastructure, particularly one with WAF running in front of either CloudFront or any other open access point that can make use of WAF.
In both of the above cases we were fortunate enough to be able to use a commercial mitigation product. Others, either for financial, organizational, or other reasons, may not be able to do so. Similarly there is always a ‘do it yourself’ mindset to code on the Internet. For anyone who falls into that category, there is Anubis. As I have not implemented this myself, I am not going to do much more here other than note its existence as an option.
Quickly learn how to host a website on AWS, whether it's a static site, CMS, or web/mobile app, in under 5 minutes.
Learn how D7UX aims to revolutionize Drupal's user experience, making clients love their websites and developers push UX boundaries.
Learn how to use AI to automatically add relevant meta descriptions to all your CMS pages. Increase your website's visibility and attract more users...
Be the first to know about new B2B SaaS Marketing insights to build or refine your marketing function with the tools and knowledge of today’s industry.