sitemapper on NPM

Sitemapper - A powerful XML sitemap parser for Node.js

📋 Overview

Sitemapper is a Node.js module that makes it easy to parse XML sitemaps. It supports single sitemaps, sitemap indexes with multiple sitemaps, and various sitemap formats including image and video sitemaps.

🚀 Installation

# Using npm
npm install sitemapper --save

# Using yarn
yarn add sitemapper

# Using pnpm
pnpm add sitemapper

🏃‍♂️ Quick Start

Module Usage

import Sitemapper from 'sitemapper';

const sitemap = new Sitemapper({
  timeout: 10000, // 10 second timeout
});

sitemap
  .fetch('https://gosla.sh/sitemap.xml')
  .then(({ url, sites }) => {
    console.log('Sites: ', sites);
  })
  .catch((error) => console.error(error));

CLI Usage

You can also use Sitemapper directly from the command line:

# Using npx
npx sitemapper https://gosla.sh/sitemap.xml

💻 Examples

Promise Example

import Sitemapper from 'sitemapper';

const sitemap = new Sitemapper();

sitemap
  .fetch('https://wp.seantburke.com/sitemap.xml')
  .then(({ url, sites }) => {
    console.log(`Sitemap URL: ${url}`);
    console.log(`Found ${sites.length} URLs`);
    console.log(sites);
  })
  .catch((error) => console.error(error));

Async/Await Example

import Sitemapper from 'sitemapper';

async function parseSitemap() {
  const Google = new Sitemapper({
    url: 'https://www.google.com/work/sitemap.xml',
    timeout: 15000, // 15 seconds
    concurrency: 10,
  });

  try {
    const { sites } = await Google.fetch();
    console.log(`Found ${sites.length} URLs in the sitemap`);
    console.log(sites);
  } catch (error) {
    console.error('Error fetching sitemap:', error);
  }
}

parseSitemap();

Advanced Example with Proxy

import Sitemapper from 'sitemapper';
import { HttpsProxyAgent } from 'hpagent';

const sitemapper = new Sitemapper({
  url: 'https://gosla.sh/sitemap.xml',
  timeout: 30000,
  concurrency: 5,
  retries: 2,
  debug: true,
  proxyAgent: new HttpsProxyAgent({
    proxy: 'http://localhost:8080',
  }),
  requestHeaders: {
    'User-Agent': 'Mozilla/5.0 (compatible; SitemapperBot/1.0)',
  },
  fields: {
    loc: true,
    lastmod: true,
    sitemap: true,
  },
});

sitemapper
  .fetch()
  .then(({ sites }) => console.log(sites))
  .catch((error) => console.error(error));

⚙️ Configuration Options

Sitemapper can be customized with the following options:

Option	Type	Default	Description
`url`	String	`undefined`	The URL of the sitemap to parse
`timeout`	Number	`15000`	Maximum timeout in milliseconds for each request
`concurrency`	Number	`10`	Maximum number of concurrent requests when crawling multiple sitemaps
`retries`	Number	`0`	Number of retry attempts for failed requests
`debug`	Boolean	`false`	Enable debug logging
`rejectUnauthorized`	Boolean	`true`	Reject invalid SSL certificates (like self-signed or expired)
`requestHeaders`	Object	`{}`	Additional HTTP headers to include with requests
`lastmod`	Number	`undefined`	Only return URLs with lastmod timestamp newer than this value
`proxyAgent`	HttpProxyAgent \| HttpsProxyAgent	`undefined`	Instance of `hpagent` for proxy support
`exclusions`	Array<RegExp>	`[]`	Array of regex patterns to exclude URLs from results
`fields`	Object	`undefined`	Specify which fields to include in the results (see below)

Available Fields

Important: When using the fields option, the return format changes from an array of URL strings to an array of objects containing your selected fields.

For the fields option, specify which fields to include by setting them to true:

Field	Description
`loc`	URL location of the page
`sitemap`	URL of the sitemap containing this URL (useful for sitemap indexes)
`lastmod`	Date of last modification
`changefreq`	How frequently the page is likely to change
`priority`	Priority of this URL relative to other URLs
`image:loc`	URL location of the image (for image sitemaps)
`image:title`	Title of the image (for image sitemaps)
`image:caption`	Caption of the image (for image sitemaps)
`video:title`	Title of the video (for video sitemaps)
`video:description`	Description of the video (for video sitemaps)
`video:thumbnail_loc`	Thumbnail URL of the video (for video sitemaps)

Example Default Output (without fields)

// Returns an array of URL strings
[
  'https://wp.seantburke.com/?p=234',
  'https://wp.seantburke.com/?p=231',
  'https://wp.seantburke.com/?p=185',
];

Example Output with Fields

// Returns an array of objects
[
  {
    loc: 'https://wp.seantburke.com/?p=234',
    lastmod: '2015-07-03T02:05:55+00:00',
    priority: 0.8,
  },
  {
    loc: 'https://wp.seantburke.com/?p=231',
    lastmod: '2015-07-03T01:47:29+00:00',
    priority: 0.8,
  },
];

🧩 CLI Usage

Sitemapper includes a simple CLI tool for basic sitemap parsing directly from the command line:

npx sitemapper <sitemap-url>

Example

npx sitemapper https://gosla.sh/sitemap.xml

Output

The CLI will display the sitemap URL and list all URLs found in the sitemap:

Sitemap URL: https://gosla.sh/sitemap.xml

Found URLs:
1. https://gosla.sh/page1
2. https://gosla.sh/page2
3. https://gosla.sh/page3
...

CLI Options

Currently, the CLI supports the --timeout parameter to set the request timeout in milliseconds:

npx sitemapper https://gosla.sh/sitemap.xml --timeout=5000

Note: The CLI implementation is basic and does not yet support all options available in the JavaScript API. More advanced features like fields filtering, concurrency control, and different output formats require using the JavaScript API directly.

🤝 Contributing

Contributions from experienced engineers are highly valued. When contributing, please consider:

Guidelines

Maintain backward compatibility where possible
Consider performance implications, particularly for large sitemaps
Add TypeScript types
Add tests for your change
Update documentation and examples
Check for typos
Code should pass ESLint, Prettier, Spell Check and TypeScript checks
Try not to bloat the main dependencies with new packages, dev dependencies are fine
If adding packages, make sure to run npm install with the latest NPM version to update package-lock.json

Pull Request Process

PRs should be focused on a single concern/feature
Include sufficient context in the PR description
Reference any relevant issues
Run npm test locally to verify your changes pass the test
- Sometimes the tests will fail since they reference real world sitemaps. Try running it again.
PRs will not run github actions by default, they need to be run manually by @seantomburke

For substantial changes, consider opening an issue for discussion before implementation.