image by Joseph Chan
Stop robots and crawlers causing errors in your Rails application
When a Rails application can’t find a record, it throws a 404 error. This is a standard HTTP code for browsers meaning ‘not found’.
When you have an Internet-facing site various search engines will be crawling it to index your pages. As you change things, certain URLs might change or cease to exist. This means search engines/crawlers can start generating a lot of ‘not found’ errors by trying to load pages that used to exist.
Instead of…
…getting a bunch of unhelpful, distracting noise in your monitoring setup, or errors in your logs, when Google (or another web crawler) hits deleted public pages…
Or…
…naively swallowing all your 404 errors.
class ApplicationController < ActionController::Base
rescue_from ActiveRecord::RecordNotFound, with: :not_found
private
def not_found(exception)
render file: Rails.root.join('public', '404.html'),
layout: nil,
status: :not_found
end
end
Use…
…the is_crawler
gem and configure it like so:
application_controller.rb
class ApplicationController < ActionController::Base
include IsCrawler
rescue_from ActiveRecord::RecordNotFound, with: :not_found
private
def not_found(exception)
if Rails.env.production? && is_crawler?(request.user_agent)
render_404
else
raise exception
end
end
def render_404
render file: Rails.root.join('public', '404.html'),
layout: nil,
status: :not_found
end
end
One of my sites gets hit by many crawlers that aren’t included by default in the gem. So I add these to the list of crawlers in an initializer:
config/initializers/is_crawler.rb
{
apple: 'Applebot',
arefs: 'AhrefsBot',
blexbot: 'BLEXBot',
dotbot: 'DotBot',
mailru: 'Mail.RU_Bot',
magestic12: 'MJ12bot',
seznam: 'SeznamBot'
}.each do |internal_name, agent_string|
Crawler::CUSTOM << Crawler.new(internal_name, agent_string)
end
But why?
If you’re trying to make your app easier to maintain, it’s important to stay on top of your errors. You have probably wired up your application to an error monitoring tool similar to Rollbar, Honeybadger, Bugsnag, or Sentry.
It is tempting to just ignore all 404 errors. However you want to know when real users receive 404 pages, as it might indicate something important is broken.
If you cannot distinguish between the genuine issues that your visitors are having and the ‘noise’ from search engines you cannot focus on fixing real problems.
Whether you’re paying for your tracking service or not, you’ll burn through your credits if you’re receiving a large volume of unnecessary errors.
Why not?
If most of your site lives behind a sign in, making this change may not be worth the extra effort.
Last updated on April 15th, 2018