crawl-domain
Crawl to discover all paths under a given URL domain.
Example
import crawl from 'crawl-domain';
for await (const url of crawl('http://localhost:3000')) {
console.log(url);
}
// → http://localhost:3000/
// → http://localhost:3000/b
// → http://localhost:3000/c
// → http://localhost:3000/d
Installation
crawl-domain
is authored as an ESM module, and therefore requires Node 12.0 or newer.
Install using NPM or Yarn:
npm install crawl-domain
yarn add crawl-domain
Usage
The default export will return a Node pipeline stream when called, which can be iterated asynchronously with for await
to operate on crawled links as soon as they're discovered:
import crawl from 'crawl-domain';
for await (const url of crawl('http://localhost:3000')) {
console.log(url);
}
If you'd prefer not to use streams, there are also Node-style callback and promise forms available, where the resolved value will be an array of all discovered URLs:
import crawl from 'crawl-domain';
crawl('http://localhost:3000', (error, urls) => {
console.log(urls);
});
import { promise as crawl } from 'crawl-domain';
const urls = await crawl('http://localhost:3000');
console.log(urls);
Options
crawl
can optionally receive an options object as the second argument.
The following options are supported:
-
concurrency
: HTTP client concurrency. Defaults to10
. -
timeout
: HTTP client timeout, in milliseconds. Defaults to10000
.
API
export function stream(
rootURL: string,
options?: Partial<CrawlOptions> | undefined,
callback?: CrawlCallback | undefined
): import('stream').Readable;
export const promise: CrawlPromise;
export default stream;
export type CrawlOptions = {
concurrency: number;
timeout: number;
};
export type CrawlCallback = (error: Error | null, urls: string[]) => void;
export type CrawlPromise = (
rootURL: string,
options?: CrawlOptions | undefined
) => Promise<string[]>;
License
Copyright 2020 Andrew Duthie
Released under the MIT License. See LICENSE.md.