robots_txt

Parse robots.txt files


Keywords
robots-exclusion-protocol, robots-exclusion-standard, robots-parser, robots-txt, robotstxt, ruby
License
MIT
Install
gem install robots_txt -v 0.9.2

Documentation

robots_txt

Gem robots_txt.

Features

  • Parses robots.txt (doh).
  • Lax parsing rules: it will get the most data out of almost any kind of input.
  • You can pass it a string or a file/IO object that responds to #read.
  • Works with any kind of whitespace and newline characters.
  • Works with files with invalid UTF-8 characters.
  • Merges groups with the same user agent.
  • crawl-delay line support.
  • sitemap line support.
  • host extension support.
  • clean-param extension support.
  • Unknown extensions support via RobotsTxt#extensions.
  • Get invalid/unknown lines via RobotsTxt#invalid_lines.
  • Allows limiting input file.
  • Allows limiting line length.

Example use

A user chooses how they want to obtain robots.txt file.

response = fetch_robots_txt
robots_txt = RobotsTxt.parse(response)

Optionally set your user agent, the default is *.

robots_txt.user_agents # an array of user agent strings
robots_txt.user_agent = "Yourbot"

Perform checks

robots_txt.allow?("/path") # true of false
robots_txt.disallow?("/another/path")

Various extensions are supported

robots_txt.crawl_delay   # A number or nil if not set
robots_txt.sitemaps      # An array of sitemap strings
robots_txt.extensions    # A hash of unknown extension field/values
robots_txt.invalid_lines # Array of lines that are not valid rules/instructions

robots_txt.host          # Host string or nil if not set
robots_txt.clean_params  # A set of clean param objects

Limits

This gem allows limiting:

  1. Input file size (default limit 50 kb)
  2. Line length (default limit 16,664 bytes)

Both limits are specified in bytes. Input size limiting works for all input types (if input is a string or readable IO object).

Example:

file = File.open("robots.txt")
robots_txt = RobotsTxt.parse(file, file_limit: 1024, line_limit: 128)
# the usual

In case a limit is reached a value is cut off (no warning or exception).

To remove default limits set the options to nil:

robots_txt = RobotsTxt.parse(input, file_limit: nil, line_limit: nil)

Unable to fetch robots.txt

Robots.txt RFC specifies what happens in a couple scenarios when robots.txt file cannot be accessed link.

This gem has two classes that help with implementing that:

  • RobotsTxt::FullAllow
  • RobotsTxt::FullDisallow

Example uses:

def robots_txt
  response = fetch_robots_txt
  RobotsTxt.parse(response.read)
rescue HTTP4xx
  RobotsTxt::FullAllow.new
end
def robots_txt
  response = fetch_robots_txt
  RobotsTxt.parse(response.read)
rescue HTTP5xx
  RobotsTxt::FullDisallow.new
end

Maintenance

This project is maintained actively, but on a slow schedule. Due to author's current life obligations it is likely:

  • No support questions will be answered.
  • No issues or new features will be worked on.

If you want to help you can submit a small (e.g. 5 lines of code) and focused PR. PRs containing big changes or new, unasked for features require a lot of time to review and I often don't get to those.

Development

  • bin/rake generate to compile rexical and racc files.
  • git config diff.nodiff.command /usr/bin/true to exclude compiled *.rex.rb and *.tab.rb files from git diff in the current repo.

LICENSE

MIT

You have full permission to fork, copy, and do whatever you want with this code.