Attack of the Robots: Cleansing Your Data From Machine-Generated Traffic

Last week I got an email from an industry veteran, now working on analytics projects at a well-known agency. In it, he said, “I have a few clients who keep asking me to make sure that they’re not being overwhelmed by bots, given all the recent press about bot traffic being up to ⅓ to even ½ of web traffic.” Indeed, lately there have been a number of articles from respected publications covering the bot phenomenon, and its impact on web data, with one article reporting that up to 61.5% of all web traffic comes from bots. If more than half of your traffic comes from bots, who are neither human nor capable of becoming loyal customers, how can you trust your data at all? How can you confidently optimize your content or your marketing in this environment?

Most of the answer comes from the way the conclusions in articles like these are generated: based on Content Delivery Network (CDN) data. The purpose of a CDN is to receive requests for content from various entities (web browsers, mobile apps, bots, etc.) and quickly provide the requested content. As such, CDNs do not inherently require that the requesting entity meet technical requirements in order to receive whatever content has been requested. A CDN receives a request, serves it, and logs it. The analyses that we are seeing published are the result of these logs; the CDN looks at the raw user-agent strings and potentially other patterns across all requests and determines which requests came from bots and which did not. It is certainly possible than more than 50% of all traffic_ at the CDN level _comes from bots.

This is different than digital analytics tools, including Adobe Analytics, which typically (although not always) require JavaScript in order to record traffic and user behavior. The vast majority of Adobe Analytics implementations over the past several years have excluded the ability to record data for users (or bots) who cannot execute JavaScript. It was only a couple of years ago that googlebot started executing JavaScript. Before that time, bot traffic Google and Bing never showed up in Adobe Analytics reports. Now, those two are the top bots that we see, but there are still far more bots that have no need to execute JavaScript, and therefore are invisible to Adobe Analytics, but still visible to a CDN. Thus—and this is really the key point—CDNs will report far more bot traffic than you will ever see in Adobe Analytics.

My colleague, Bret Gundersen, recently did a small, cross-vertical study of bot traffic as a percentage of total page views over a month of data in report suites of various sizes. Here is a representative sample:

2,759,711

1,630,572,644

0.17%

294,558

41,643,593

0.71%

3,877,924

252,966,728

1.53%

505,802

23,709,635

2.13%

As Bret pointed out, “Bots are looking for content. Sites with fewer pages will see less bot traffic. The report suite with the highest bot traffic [in the table above] is a travel site that has thousands of hotels, flights, cruises, etc. When a CDN looks at bot traffic, they’re looking internet-wide, so they will see bots hit pages that people never hit, such as archived content. I’ve talked with publishers who have millions of archived pages, many of which get no visitor traffic each day.” In other words, because bots are indiscriminately crawling the entire Internet, their reported traffic data is massive due to bots hitting a far larger range of pages and content than humans hit. Perhaps a better way to put it is that, even _if _most bots were tracked in Adobe Analytics, bot traffic to _pages and content that actually matter to you for analysis and optimization _would be far lower than the numbers reported in these recent CDN-based studies.

Removing known bot traffic from Adobe Analytics

Despite the fact that users of digital analytics tools needn’t be concerned about the possibility that half of their traffic comes from bots, many of you will want to exclude even those bots that do execute JavaScript, to add an extra level of data cleanliness and analysis confidence. This is easy to do. You can turn on bot filtering in the Admin Console by going to Edit Settings > General > Bot Rules, as shown below.

Within the Bot Rules screen, you have a few options:

The default bot filtering in Adobe Analytics is based on the IAB bot list, which is updated monthly and compiles its list from many sources, including CDNs and major internet properties. It includes thousands of known bots including all of your favorites: Google, Bing, Gomez, etc. This list covers the overwhelming majority of use cases and needs around bot filtration. If you want to use the IAB bot list, you can simply check the box shown above, click “Save,” and you are done.

In addition to (or in place of) the IAB bot list, you can also input or upload a list of bots to be filtered out of your data set. This list can be based on user-agent string (typically the easiest way to identify a bot at first glance) and/or IP address. For example, if you have an internally created bot that crawls your site to monitor uptime, you can enter the details for that bot in the setup tool and it will be filtered out of your data from that point forward.

In keeping with the findings referenced above, we have not seen massive drops in traffic when customers have turned on bot filtering. This is not because the filtering is not working (see the following section of this post), but rather it is because bot traffic did not represent a large percentage of measurable data to begin with.

What happens when I filter bots out of my data?

Adobe Analytics still collects and processes data from bots that execute JavaScript, but rather than having them pollute your reports, they are tucked away in their own section of the tool. By default, this is located in Site Metrics > Bots.

Bots shows the various bots that were identified, such as “gomezagent” in the screen shot below, along with traffic data for each.
Bot Pages shows the different pages that bots visited on your site.

Here is what the Bots report might look like:

Data in these reports is available from the time that you enable bot filtering; historical data prior to that time will not be available in these reports, and will continue to exist throughout your data. Detected bots are automatically excluded from segments, calculated metrics, etc.; they are indeed fully separated from your data set into the two reports listed above. Projects in Ad Hoc Analysis or dashboard in Report Builder will not include this bot data. If you publish an audience to the Adobe Marketing Cloud for analytics-powered targeting, it will also not include bots.

Don’t fear a robot uprising

I hope it is clear that, while robots are certainly part of the Internet ecosystem, they are nothing to be afraid of when it comes to your digital and customer analysis. The actual date in your report suites in Adobe Analytics should not be seeing more than a tiny percentage of their traffic from bots, and that traffic can easily be filtered out of reports and segments using Bot Rules in the Admin Console. Armed with an understanding of the cause of the massive bot numbers you may be seeing in articles around the web, and of how digital analytics measurement works, you are ready to push your analysis and marketing forward, without concern of a robot uprising to wreak havoc on your data.