How To Block Legitimate And Illegitimate Bots from Crawling and Scraping Your Site

Apr 01, 2024

5 minutes read Markdown

Martech Zone continues to grow in popularity over recent weeks… and with it it’s also becoming a popular site for hackers and bots. Last week, my hosting company alerted me to my site being hammered with what almost appeared to be a DDoS attack, but it was coming from a user agent called claudbot. It was hitting my site so hard that they had to move it to a new server, which would have cost six times more. I have no idea what that bot is or who unleashed it on my site, so my host helped me block it using an .htaccess file.

Websites are constantly visited by various types of bots, some legitimate and others malicious. These bots can consume significant server resources, slow website performance, and even scrape valuable content for competitive analysis. When a bot slows down your site, it impacts the user experience (UX) of your visitors and can severely impact search engine rankings if it’s ongoing.

As a company, it’s essential to understand how to block both legitimate and illegitimate bots to protect your website and ensure optimal performance for your human visitors.

Blocking Legitimate Bots

Legitimate bots, such as those from search engines and SEO tools, can strain your server resources if left unchecked. You may also not want an SEO tool bot to capture and provide detailed information about your content and pages in their platform to your competitors.

While these bots serve a purpose, their aggressive crawling behavior can negatively impact your website’s performance. You can use your .htaccess file to block specific bots based on their user agent strings to mitigate this issue.

How To Block Known Bots Using .htaccess

Blocking legitimate bots can help:

Reduce bandwidth and resource usage
Prevent content scraping
Improve analytics accuracy
Ensure compliance with third-party tool terms of service

Here’s a section of my .htaccess file that’s dedicated to blocking bots:

<IfModule mod_rewrite.c>
  RewriteEngine on
  RewriteBase /
  RewriteCond %{HTTP_USER_AGENT} ("Ahrefs"|"AhrefsBot/6.1"|"AspiegelBot"|"Baiduspider"|"BLEXBot"|"Bytespider"|"claudebot"|"Datanyze"|"Kinza"|"LieBaoFast"|"Mb2345Browser"|"MicroMessenger"|"OPPO\sA33"|"PetalBot"|"SemrushBot"|"serpstatbot"|"spaziodati"|"YandexBot"|"YandexBot/3.0"|"zh-CN"|"zh_CN") [NC]
  RewriteRule ^ - [F,L]
</IfModule>

<IfModule mod_rewrite.c> and </IfModule>: These directives ensure that the enclosed rewrite rules are only processed if the Apache module mod_rewrite is available and loaded. This is a good practice to prevent errors if the module is not enabled.
RewriteEngine on: This line enables the rewrite engine, allowing the use of rewrite rules.
RewriteBase /: This sets the base URL for the rewrite rules. In this case, it is set to the root directory (/).
RewriteCond %{HTTP_USER_AGENT} ("AhrefsBot/6.1"|"Ahrefs"|"Baiduspider"|"BLEXBot"|"SemrushBot"|"claudebot"|"YandexBot/3.0"|"Bytespider"|"YandexBot"|"Mb2345Browser"|"LieBaoFast"|"zh-CN"|"MicroMessenger"|"zh_CN"|"Kinza"|"Datanyze"|"serpstatbot"|"spaziodati"|"OPPO\sA33"|"AspiegelBot"|"PetalBot") [NC]: This line defines a condition for the rewrite rule. It checks if the user agent string of the incoming request matches any of the specified bot names or patterns. The [NC] flag at the end makes the comparison case-insensitive. The condition uses the %{HTTP_USER_AGENT} variable to retrieve the user agent string from the request headers. The bot names and patterns are enclosed within parentheses and separated by the pipe character (|), which acts as an “OR” operator. This means that if the user agent string matches any one of the listed bots, the condition will be considered met.
RewriteRule ^ - [F,L]: This line defines the rewrite rule that is triggered when the preceding condition is met. The ^ symbol matches the beginning of the request URL, and the - (dash) is used as a placeholder for an empty substitution string. The [F,L] flags at the end specify the actions to be taken when the rule matches:

Bot List

Here is a list of the bots that I’ve blocked as well as whether they’re known or unknown.

AhrefsBot/6.1 and Ahrefs: Web crawlers used by Ahrefs, an SEO and website analysis tool. They crawl websites to gather data for backlink analysis, keyword research, and site audits.
AspiegelBot: Web crawler used by Aspiegel, an Austrian company that provides web scraping and data extraction services.
Baiduspider: Web crawler used by Baidu, a Chinese search engine. It indexes web pages for Baidu’s search results.
BLEXBot: Web crawler used by a Swedish SEO company. It is used for website analysis and SEO purposes.
Bytespider: Unknown bot or known SPAM bot.
claudebot: Unknown bot or known SPAM bot.
Datanyze: Web crawler used by Datanyze, a company that provides technographic data and sales intelligence.
Kinza: Unknown bot or known SPAM bot.
LieBaoFast: Unknown bot or known SPAM bot.
Mb2345Browser: Unknown bot or known SPAM bot.
MicroMessenger: User agent for WeChat, a popular Chinese messaging and social media app developed by Tencent.
OPPO A33: Unknown bot or known SPAM bot.
PetalBot: Web crawler used by Aspiegel, an Austrian company that offers website analysis and monitoring services.
SemrushBot: Web crawler used by Semrush, an SEO and online visibility management platform. It crawls websites to gather data for keyword research, site audits, and competitor analysis.
serpstatbot: Web crawler used by Serpstat, an all-in-one SEO platform. It is used for website analysis, keyword research, and competitor analysis.
spaziodati: Web crawlers used by Spaziodati, an Italian company that provides web scraping and data extraction services.
YandexBot/3.0 and YandexBot: Web crawlers used by Yandex, a Russian search engine and technology company. They crawl and index web pages for Yandex’s search results.
zh-CN and zh_CN: Unknown bot or known SPAM bot.

I did my best to research these, so please let me know if you see anything inaccurate. Where I couldn’t identify information, I marked the bot as “Unknown bot or known SPAM bot” to avoid sharing potentially inaccurate information.

Identifying and Blocking Illegitimate Bots

Illegitimate bots, such as those used for content scraping, spamming, or malicious activities, often attempt to disguise themselves to avoid detection. They may employ simple techniques like mimicking legitimate user agents, rotating user agents, using headless browsers, or more complex techniques like distributing requests across multiple IP addresses.

To identify and block illegitimate bots, consider the following strategies:

Analyze traffic patterns: Monitor your website traffic for suspicious patterns, such as high request rates from single IP addresses, unusual user agent strings, or atypical browsing behavior.
Implement rate limiting: Set up rate limiting based on IP addresses or other request characteristics to prevent bots from making excessive requests and consuming server resources.
Use CAPTCHAs: Implement CAPTCHAs or other challenge-response mechanisms to verify human users and deter automated bots.
Monitor and block suspicious IP ranges: Monitor your server logs and block IP ranges that consistently exhibit bot-like behavior.
Employ server-side rendering or API-based data delivery: Make scraping more difficult by rendering content on the server-side or delivering data through APIs, rather than serving plain HTML.
Regularly update bot-blocking rules: Continuously monitor and adapt your bot-blocking rules based on observed bot behavior, as illegitimate bots may evolve their techniques over time.

Blocking both legitimate and illegitimate bots is crucial for protecting your website’s performance, resources, and content. By implementing strategic .htaccess rules and employing various bot-detection and mitigation techniques, you can effectively defend against the negative impact of bots on your website.

Remember, bot blocking is an ongoing process that requires regular monitoring and adaptation. Stay vigilant and proactive in your bot-blocking efforts to ensure a smooth and secure experience for your human visitors while safeguarding your website from the detrimental effects of bots.

Blocking Legitimate Bots

How To Block Known Bots Using .htaccess

Bot List

Identifying and Blocking Illegitimate Bots

Related Articles

Free Martech Zone Tools