bontan

Utility scripts for scraping


Keywords
scraping, html, css
License
WTFPL
Install
npm install bontan@1.0.0

Documentation

 _                 _              
| |__   ___  _ __ | |_ __ _ _ __  
| '_ \ / _ \| '_ \| __/ _` | '_ \ 
| |_) | (_) | | | | || (_| | | | |
|_.__/ \___/|_| |_|\__\__,_|_| |_|

Bontan is a simple scraper with specialized behavior for some sites (like Wikipedia) and smart fallbacks for others. This repository also includes other scraping utilities.

  • bontan: Attempt to scrape all useful text content on the page.
  • kinkan: Like Bontan, but as a summary - just grabs the first paragraph. ("kinkan" is Japanese for "kumquat".)
  • sudachi: Simple utility to scrape using css selectors.
  • mikan: A simple clone of import.io. Prints each item in the biggest list on the page.

Examples

First install it:

npm install -g bontan

bontan

For a full text version of the page:

bontan 'https://en.wikipedia.org/wiki/Pomelo'

Images will have their src printed.

kinkan

Like Bontan, but with less output - typically it will use the page title, the first image, and the first p element.

kinkan 'https://en.wikipedia.org/wiki/Pomelo'

Pomelo - Wikipedia, the free encyclopedia
/wikipedia/commons/thumb/1/1c/Citrus_grandis_-_Honey_White.jpg/220px-Citrus_grandis_-_Honey_White.jpg
Citrus maxima (or Citrus grandis), (Common names: shaddick,[1] pomelo, pummelo, pommelo, pamplemousse, or shaddok) is a natural (non-hybrid) citrus fruit, with the appearance of a big grapefruit, native to South and Southeast Asia.

sudachi

For grabbing things by css selectors. It uses a virtual dom (domino), which makes it comparatively fast but unable to handle contents generated by JS after page load. Let's try getting all the h3 elements:

sudachi 'https://en.wikipedia.org/wiki/Pomelo' h3

Possible non-hybrid pomelos[edit]
Hybrids[edit]
Personal tools
Namespaces
etc.

You can pass -r to return innerHTML instead of textContent.

sudachi -r 'https://en.wikipedia.org/wiki/Pomelo' h3

<span class="mw-headline" id="Possible_non-hybrid_pomelos">Possible non-hybrid pomelos</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Pomelo&amp;action=edit&amp;section=5" title="Edit section: Possible non-hybrid pomelos">edit</a><span class="mw-editsection-bracket">]</span></span>
<span class="mw-headline" id="Hybrids">Hybrids</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Pomelo&amp;action=edit&amp;section=6" title="Edit section: Hybrids">edit</a><span class="mw-editsection-bracket">]</span></span>
Personal tools
Namespaces
etc.

mikan

This attempts to replicate some of the magic of import.io using a simple trick - usually, the most interesting list on a page is the longest one. Here's what happens when you point it at Stack Overflow:

mikan 'http://stackoverflow.com'
   0 votes   1 answer   4 views     Unique index not working  ruby-on-rails unique-constraint database-indexes   answered 1 min ago Thong Kuah 1,960   
   0 votes   0 answers   2 views     fprintf giving me a blank .txt file in MATLAB  matlab   asked 1 min ago physicist82 1   
   0 votes   0 answers   11 views     Node.js / Inheritance of variables and modules  javascript node.js inheritance   modified 1 min ago MiddleWare 138   
   5 votes   2 answers   30 views     Global Events in Angular 2  angular2   modified 1 min ago pixelbits 14.5k   
   0 votes   1 answer   5 views     Running “mvn test site” giving [ERROR] Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.3:site (default-site) on project  maven selenium xslt   modified 1 min ago Tunaki 29k   
 etc.

Or Hacker News:

mikan 'http://news.ycombinator.com'
 1.      The Trouble with the TPP, Day 5: Rights Holders “Shall” vs. Users “May” (michaelgeist.ca)
 46 points by walterbell 3 hours ago  | discuss

 2.      Tesla Model S can now park itself (techcrunch.com)
 151 points by prostoalex 6 hours ago  | 89 comments

 3.      Nvidia GPUs can break Chrome's incognito mode (charliehorse55.wordpress.com)
 374 points by charliehorse55 11 hours ago  | 120 comments

 etc.

License

WTFPL, do as you please.

-POLM