manuasir/node-iceberg


NodeJS tree-based crawler

License: GPL-3.0

Language: JavaScript

Keywords: crawler, es6, es8, javascript, nodejs, npm, tree-structure


NPM

Codacy Badge Build Status Code-Style Dependency Status Known Vulnerabilities Coverage Status

node-iceberg

A lightweight Node.js tree-based scraper/crawler. No more callbacks! Just async/await.

Installation

Download and install via npm package manager (stable):

npm install node-iceberg --save

Or clone fresh code directly from git:

git clone https://github.com/manuasir/node-iceberg.git
cd node-iceberg
npm install

Usage

This package allows to get filtered DOM elements from URLs throw customized iterators, and it works mainly in two ways:

  • Scraper mode: Get elements using customized selectors.
// Example: download all links from a Blogspot URL. Use inside an 'async' function
const Iceberg = require('node-iceberg')
const getThisDomains = ['mediafire','mega','adf.ly'] // domains we can get
const conf = {
	// Iterator: Element that gets you to the next URL to process in blogspot
	iteratorElement: {
    	element: 'a',
    	cssClass: 'blog-pager-older-link' // optional
	},
	// Desired data to get extracted from the DOM. Example: Download links
	selector: {
    	element: 'a',
    	attrib: 'href',
    	values: getThisDomains // optional
	}
}
// Max Level Depth to explore: max blog pages
const maxLevelDepth = 10
const scraper= new Iceberg("http://someblog.blogspot.com")
const results = await scraper.start(maxLevelDepth,conf)
// or load a filter:
const scraper= new Iceberg("http://someblog.blogspot.com")
const conf = scraper.service('blogspot')
const results = await scraper.start(maxLevelDepth,conf)

Some websites are allowed to paginate directly from the URL using parameters. Ej: http://url?page=1,2,3... In that case, use it like this: pass only the configuration object and set max page within it:

// Example: get insecure cameras: insecam.org
const Iceberg = require('node-iceberg')
const conf = {
    iteratorElement: {
        url: url,
        iterator: '?page=',
        maxPage: 5 // maximum page to explore
    }, selector: { // elements we want to get
        element: 'img',
        attrib: 'src'
        }
    }
const scraper = new Iceberg("http://insecam.org")
const results = scraper.start(conf).then((results) => { console.log(results) }).catch((err)=>{ throw err })
  • Crawler mode: Explores ALL links from a URL until the depth threshold is reached. Generates a tree from crawled data. Already explored ways are included only once.
// Warning! this can become a very costful task
const Iceberg = require('node-iceberg')
const crawler = new Iceberg('http://reddit.com')
const conf = crawler.service('crawler')
const maxLevelDepth = 2
const results = crawler.start(maxLevelDepth,conf)then((results) => { console.log(results) }).catch((err)=>{ throw err })

Test

If you downloaded this package from NPM, it's already tested. Otherwise you can test it like this:

npm test

Project Statistics

Sourcerank 3
Repository Size 903 KB
Stars 2
Forks 3
Watchers 1
Open issues 0
Dependencies 110
Contributors 13
Tags 9
Created
Last updated
Last pushed

Top Contributors See all

Manuel J. Bernal Ismael Martínez Ismael Martínez Jake Jake Jake Wiard van Rij Wiard van Rij Wiard van Rij Codacy Badger Codacy Badger Ismael Martínez Codacy Badger

Packages Referencing this Repo

node-iceberg
nodejs lightweight scraper/crawler
Latest release 1.7.1 - Updated - 2 stars

Recent Tags See all

v1.7.1 November 20, 2017
v1.7.1 November 20, 2017
v1.7.0 November 19, 2017
v1.7.0 November 19, 2017
v1.7.0 November 19, 2017
v1.6.0 November 14, 2017
v1.6.0 November 14, 2017
v1.6.0 November 14, 2017
v1.6.0 November 14, 2017

Something wrong with this page? Make a suggestion

Last synced: 2017-11-20 04:20:25 UTC

Login to resync this repository