robots-txt-parser
RobotsTxtParser — PHP class for parsing all the directives of the robots.txt files
RobotsTxtValidator — PHP class for check is url allow or disallow according to robots.txt rules.
Try demo of RobotsTxtParser on-line on live domains.
Parsing is carried out according to the rules in accordance with Google & Yandex specifications:
Last improvements:
- Pars the Clean-param directive according to the clean-param syntax.
- Deleting comments (everything following the '#' character, up to the first line break, is disregarded)
- The improvement of the Parse of Host — the intersection directive, should refer to the user-agent '*'; If there are multiple hosts, the search engines take the value of the first.
- From the class, unused methods are removed, refactoring done, the scope of properties of the class is corrected.
- Added more test cases, as well as test cases added to the whole new functionality.
- RobotsTxtValidator class added to check if url is allowed to parsing.
- With version 2.0, the speed of RobotsTxtParser was significantly improved.
Supported Directives:
- DIRECTIVE_ALLOW = 'allow';
- DIRECTIVE_DISALLOW = 'disallow';
- DIRECTIVE_HOST = 'host';
- DIRECTIVE_SITEMAP = 'sitemap';
- DIRECTIVE_USERAGENT = 'user-agent';
- DIRECTIVE_CRAWL_DELAY = 'crawl-delay';
- DIRECTIVE_CLEAN_PARAM = 'clean-param';
- DIRECTIVE_NOINDEX = 'noindex';
Installation
Install the latest version with
$ composer require bopoda/robots-txt-parser
Run tests
Run phpunit tests using command
$ php vendor/bin/phpunit
Usage example
You can start the parser by getting the content of a robots.txt file from a website:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
var_dump($parser->getRules());
Or simply using the contents of the file as input (ie: when the content is already cached):
$parser = new RobotsTxtParser("
User-Agent: *
Disallow: /ajax
Disallow: /search
Clean-param: param1 /path/file.php
User-agent: Yahoo
Disallow: /
Host: example.com
Host: example2.com
");
var_dump($parser->getRules());
This will output:
array(2) {
["*"]=>
array(3) {
["disallow"]=>
array(2) {
[0]=>
string(5) "/ajax"
[1]=>
string(7) "/search"
}
["clean-param"]=>
array(1) {
[0]=>
string(21) "param1 /path/file.php"
}
["host"]=>
string(11) "example.com"
}
["yahoo"]=>
array(1) {
["disallow"]=>
array(1) {
[0]=>
string(1) "/"
}
}
}
In order to validate URL, use the RobotsTxtValidator class:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$validator = new RobotsTxtValidator($parser->getRules());
$url = '/';
$userAgent = 'MyAwesomeBot';
if ($validator->isUrlAllow($url, $userAgent)) {
// Crawl the site URL and do nice stuff
}
Contribution
Feel free to create PR in this repository. Please, follow PSR style.
See the list of contributors which participated in this project.
Final Notes:
Please use v2.0+ version which works by same rules but is more highly performance.