Give the URL to scrape and some CSS selectors. Get a RSS::Rss instance in return.


Keywords
atom-feed, extract, feed, feed-configs, html, html2rss, json, rss, rss-aggregator, rss-bridge, rss-builder, rss-feed, rss-feed-scraper, rss-generator, ruby, scrape, scraper, scraping, scraping-websites, yahoo-pipes
License
MIT
Install
gem install html2rss -v 0.9.0

Documentation

html2rss logo

Build Status Gem Version API docs on RubyDoc.info

Request HTML from an URL and transform it to a Ruby RSS 2.0 object.

Are you searching for a ready to use "website to RSS" solution? Check out html2rss-web!

Each website needs a feed config which contains the URL to scrape and CSS selectors to extract the required information (like title, URL, ...). This gem provides extractors (e.g. extract the information from an HTML attribute) and chainable post processors to make information retrieval even easier.

Installation

Add this line to your application's Gemfile: gem 'html2rss'
Then execute: bundle

rss = Html2rss.feed(
  channel: { title: 'StackOverflow: Hot Network Questions', url: 'https://stackoverflow.com/questions' },
  selectors: {
    items: { selector: '#hot-network-questions > ul > li' },
    title: { selector: 'a' },
    link: { selector: 'a', extractor: 'href' }
  }
)

puts rss.to_s

Usage with a YAML config file

Create a YAML config file. Find an example at rspec/config.test.yml.

Html2rss.feed_from_yaml_config(File.join(['spec', 'config.test.yml']), 'nuxt-releases') returns

an RSS:Rss object.

Too complicated? See html2rss-configs for ready-made feed configs!

Scraping JSON

Since 0.5.0 it is possible to scrape and process JSON.

Adding json: true to the channel config will convert the JSON response to XML.

Feed config:

channel:
  url: https://example.com
  title: "Example with JSON"
  json: true
# ...

Imagine this HTTP response:

{
  "data": [{ "title": "Headline", "url": "https://example.com" }]
}

will be converted to:

<html>
  <data>
    <datum>
      <title>Headline</title>
      <url>https://example.com</url>
    </datum>
  </data>
</html>

Your items selector would be data > datum, the item's link selector would be url.

Under the hood it uses ActiveSupport's Hash.to_xml core extension for the JSON to XML conversion.

Set any HTTP header in the request

You can add any HTTP headers to the request to the channel URL. You can use this to e.g. have Cookie or Authorization information being sent or to overwrite the User-Agent.

channel:
  url: https://example.com
  title: "Example with http headers"
  headers:
    "User-Agent": "html2rss-request"
    "X-Something": "Foobar"
    "Authorization": "Token deadbea7"
    "Cookie": "monster=MeWantCookie"
# ...

The headers provided by the channel will be merged into the global headers.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/gildesmarais/html2rss.

Releasing a new version

  1. git pull
  2. increase version in lib/html2rss/version.rb
  3. bundle
  4. commit the changes
  5. git tag v....
  6. standard-changelog -f
  7. git add CHANGELOG.md && git commit --amend
  8. git tag v.... -f
  9. git push && git push --tags