scrape-html-web

Extract content from a static HTML website.

ESM, CJS Node >=16

When you install Scrape HTML Web, no version of Chromium will be downloaded, unlike, for example, Puppeteer. This makes it a fast and light library.

Access to all websites is not guaranteed, this may depend on the authorization they have.

The library is asynchronous.

Note: two dependencies are included in order to work:

axios - retrieve the web page
cheerio - manage content scraping based you have formatted selectors

Installation

To use Scrape HTML Web in your project, run:

npm install scrape-html-web
or
yarn add scrape-html-web

Usage

import { scrapeHtmlWeb } from "scrape-html-web";
//or
const { scrapeHtmlWeb } = require("scrape-html-web");

//example
const options = {
  url: "https://nodejs.org/en/blog/",
  bypassCors: true, // avoids running errors in esm
  mainSelector: ".blog-index",
  childrenSelector: [
    { key: "date", selector: "time", type: "text" },
    // by default, the first option that is taken into consideration is att
    { key: "version", selector: "a", type: "text" },
    { key: "link", selector: "a", attr: "href" },
  ],
};

(async () => {
  const data = await scrapeHtmlWeb(options);
  console.log(data);
})();

Response

//Example response

[
  {
    date: "04 Nov",
    version: "Node v18.12.1 (LTS)",
    link: "/en/blog/release/v18.12.1/",
  },
  {
    date: "04 Nov",
    version: "Node v19.0.1 (Current)",
    link: "/en/blog/release/v19.0.1/",
  },
  ...{
    date: "11 Jan",
    version: "Node v17.3.1 (Current)",
    link: "/en/blog/release/v17.3.1/",
  },
  {
    date: "11 Jan",
    version: "Node v12.22.9 (LTS)",
    link: "/en/blog/release/v12.22.9/",
  },
];

options

url - urls to scraper site web required
bypassCors - Url to bypass cors errors in ESM
mainSelector - indicates the main selector where to start scraping required
list - indicates that we need to iterate through a list of elements containing mainSelector, default is True not required
childrenSelector - is an array made up of parameters to define the object we expect to receive required

url

const options = {
  url: "https://nodejs.org/en/blog/" //url from which you want to extrapolate the data,
  ...
};

bypassCors

const options = {
  bypassCors: {
    customURI: "https://api.allorigins.win/get?url=",
    paramExstract: "contents",
  }, // bypass cors error in ESM
  ...
};
const options = {
  bypassCors: true,
  ...
};

You can use the default URL or use a custom one.

If you pass a Boolean without specifying anything, the default URL will be used, which is the following: https://api.allorigins.win/get?url=
it is also possible to pass a custom URL indicating the following parameters:

- customURI: Custom URL ** required

- paramExstract: Any extraction parameter deriving from the call ** _not required_

mainSelector

const options = {
   ...
   mainSelector: ".blog-index" //the parent selector where you want to start from,
   ...
};

//Extract **HTML**:

//example HTML
<ul class="blog-index">
    <li>
      <time datetime="2022-11-04T22:34:29+0000">04 Nov</time>
      <a href="/en/blog/release/v18.12.1/">Node v18.12.1 (LTS)</a>
    </li>
    <li>
      <time datetime="2022-11-04T18:05:19+0000">04 Nov</time>
      <a href="/en/blog/release/v19.0.1/">Node v19.0.1 (Current)</a>
    </li>
</ul>

list

const options = {
  ...
  list: true|false
  // if false it will only loop once over the parent element
  // if true it will loop through all elements below the parent element
  ...
};

childrenSelector

const options = {
  ...
  childrenSelector: [
    { key: "date", selector: "time", type: "text" },
    { key: "version", selector: "a", type: "text" },
    { key: "link", selector: "a", attr: "href" },
  ],
};

key: is the name of the key ** required
selector: is the name of the selector that is searched for in the HTML that is contained by the parent ** required
attr: indicates what kind of attribute you want to get ** not required

Some of the more common attributes are − [ className, tagName, id, href, title, rel, src, style ]
type: indicates the type of value to be obtained ** not required (Default: "Text")

possible values: [ text , html ]

optional
replace - with this parameter it is possible to have text or html inside a selector. It is possible to provide it with either a RegExp or a custom function ** not required
canBeEmpty: - by default it is set to false ( grants the ability to leave the value of an element blank ) ** not required

{ key: "title", selector: ".title", type: "text", canBeEmpty: true }, Example response: {title: ''} if text in selector is empty

replace

const options = {
  url: "https://nodejs.org/en/blog/",
  mainSelector: ".blog-index",
  childrenSelector: [
    {
      key: "date",
      selector: "time",
      type: "text",
      replace: (text) => text + " 2022",
      /* I pass a custom function that adds the
      "2022" test to the date I get from the selector */
    },
    {
      key: "version",
      selector: "a",
      type: "html",
      replace: /[{()}]/g,
      /* I pass a regex to remove
      the round paraesthesia within the html */
    },
    {
      key: "link",
      selector: "a",
      attr: "href",
    },
  ],
};

(async () => {
  const data = await scrapeHtmlWeb(options);

  console.log("example 2 :", data);
})();
//Example response

[
  {
    date: "04 Nov 2022",
    version: '<a href="/en/blog/release/v18.12.1/">Node v18.12.1 LTS</a>',
    link: "/en/blog/release/v18.12.1/",
  },
  {
    date: "04 Nov 2022",
    version: '<a href="/en/blog/release/v19.0.1/">Node v19.0.1 Current</a>',
    link: "/en/blog/release/v19.0.1/",
  },
  ...{
    date: "11 Jan 2022",
    version: '<a href="/en/blog/release/v17.3.1/">Node v17.3.1 Current</a>',
    link: "/en/blog/release/v17.3.1/",
  },
  {
    date: "11 Jan 2022",
    version: '<a href="/en/blog/release/v12.22.9/">Node v12.22.9 LTS</a>',
    link: "/en/blog/release/v12.22.9/",
  },
];

❤️ Support

If you make any profit from this or you just want to encourage me, you can offer me a coffee and I'll try to accommodate you.

Please note: 🙏

This library was created for educational purposes and excludes the intention to take information for which authorization to do so is not granted

scrape-html-web
Release 2.3.4

Release 2.3.4

0.1.2

0.0.1

2.3.7

2.3.4

2.2.4

2.2.3

2.2.2

2.2.1

2.1.33

1.1.33

Documentation