robots_txt
Gem robots_txt.
Features
- Parses
robots.txt
(doh). - Lax parsing rules: it will get the most data out of almost any kind of input.
- You can pass it a string or a file/IO object that responds to
#read
. - Works with any kind of whitespace and newline characters.
- Works with files with invalid UTF-8 characters.
- Merges groups with the same user agent.
-
crawl-delay
line support. -
sitemap
line support. -
host
extension support. -
clean-param
extension support. - Unknown extensions support via
RobotsTxt#extensions
. - Get invalid/unknown lines via
RobotsTxt#invalid_lines
. - Allows limiting input file.
- Allows limiting line length.
Example use
A user chooses how they want to obtain robots.txt
file.
response = fetch_robots_txt
robots_txt = RobotsTxt.parse(response)
Optionally set your user agent, the default is *
.
robots_txt.user_agents # an array of user agent strings
robots_txt.user_agent = "Yourbot"
Perform checks
robots_txt.allow?("/path") # true of false
robots_txt.disallow?("/another/path")
Various extensions are supported
robots_txt.crawl_delay # A number or nil if not set
robots_txt.sitemaps # An array of sitemap strings
robots_txt.extensions # A hash of unknown extension field/values
robots_txt.invalid_lines # Array of lines that are not valid rules/instructions
robots_txt.host # Host string or nil if not set
robots_txt.clean_params # A set of clean param objects
Limits
This gem allows limiting:
- Input file size (default limit 50 kb)
- Line length (default limit 16,664 bytes)
Both limits are specified in bytes. Input size limiting works for all input
types (if input is a string or read
able IO object).
Example:
file = File.open("robots.txt")
robots_txt = RobotsTxt.parse(file, file_limit: 1024, line_limit: 128)
# the usual
In case a limit is reached a value is cut off (no warning or exception).
To remove default limits set the options to nil
:
robots_txt = RobotsTxt.parse(input, file_limit: nil, line_limit: nil)
robots.txt
Unable to fetch Robots.txt RFC specifies what happens in a couple scenarios when robots.txt
file cannot be accessed
link.
This gem has two classes that help with implementing that:
RobotsTxt::FullAllow
RobotsTxt::FullDisallow
Example uses:
def robots_txt
response = fetch_robots_txt
RobotsTxt.parse(response.read)
rescue HTTP4xx
RobotsTxt::FullAllow.new
end
def robots_txt
response = fetch_robots_txt
RobotsTxt.parse(response.read)
rescue HTTP5xx
RobotsTxt::FullDisallow.new
end
Maintenance
This project is maintained actively, but on a slow schedule. Due to author's current life obligations it is likely:
- No support questions will be answered.
- No issues or new features will be worked on.
If you want to help you can submit a small (e.g. 5 lines of code) and focused PR. PRs containing big changes or new, unasked for features require a lot of time to review and I often don't get to those.
Development
-
bin/rake generate
to compilerexical
andracc
files. -
git config diff.nodiff.command /usr/bin/true
to exclude compiled*.rex.rb
and*.tab.rb
files fromgit diff
in the current repo.
LICENSE
You have full permission to fork, copy, and do whatever you want with this code.