Skip to content

Latest commit

 

History

History
537 lines (406 loc) · 16 KB

README.md

File metadata and controls

537 lines (406 loc) · 16 KB

html2rss logo

Build Status Gem Version Coverage Status Yard Docs Retro Badge: valid RSS

Searching for a ready to use app which serves generated feeds via HTTP? Head over to html2rss-web!

This Ruby gem builds RSS 2.0 feeds from a feed config.

With the feed config containing the URL to scrape and CSS selectors for information extraction (like title, URL, ...) your RSS builds. Extractors and chain-able post processors make information extraction, processing and sanitizing a breeze. Scraping JSON responses and setting HTTP request headers is supported, too.

Installation

🤩 Like it? Star it! ⭐️
Add this line to your application's Gemfile: gem 'html2rss'
Then execute: bundle
In your code: require 'html2rss'

😍 Love it? Feel free to donate. Thank you! 💓

Building a feed config

Here's a minimal working example:

require 'html2rss'

rss =
  Html2rss.feed(
    channel: { url: 'https://stackoverflow.com/questions' },
    selectors: {
      items: { selector: '#hot-network-questions > ul > li' },
      title: { selector: 'a' },
      link: { selector: 'a', extractor: 'href' }
    }
  )

puts rss

A feed config consists of a channel and a selectors Hash. The contents of both hashes are explained below.

Looks too complicated? See html2rss-configs for ready-made feed configs!

The channel

attribute type default remark
url required String
title optional String auto-generated
description optional String auto-generated
ttl optional Integer 360 TTL in minutes
time_zone optional String 'UTC' TimeZone name
language optional String 'en' Language code
author optional String Format: email (Name)'
headers optional Hash {} Set HTTP request headers. See notes below.
json optional Boolean false Handle JSON response. See notes below.

The selectors

You must provide an items selector hash which contains the CSS selector. items needs to return a collection of HTML tags. The other selectors are scoped to the tags of the items' collection.

To build a valid RSS 2.0 item each item has to have at least a title or a description.

Your selectors can contain arbitrary selector names, but only these will make it into the RSS feed:

RSS 2.0 tag name in html2rss remark
title title
description description Supports HTML.
link link A URL.
author author
category categories See notes below.
enclosure enclosure See notes below.
pubDate update An instance of Time.
guid guid Generated from the title.
comments comments A URL.
source source Not yet supported.

The selector hash

Your selector hash can have these attributes:

name value
selector The CSS selector to select the tag with the information.
extractor Name of the extractor. See notes below.
post_process A hash or array of hashes. See notes below.

Reverse ordering of items

The items selector hash can have an order attribute. If the value is reverse the order of items in the RSS will be reversed.

See a YAML feed config example
channel:
  # ... omitted
selectors:
  items:
    selector: 'ul > li'
    order: 'reverse'
  # ... omitted

Using extractors

Extractors help with extracting the information from the selected HTML tag.

  • The default extractor is text, which returns the tag's inner text.
  • The html extractor returns the tag's outer HTML.
  • The href extractor returns a URL from the tag's href attribute and corrects relative ones to absolute ones.
  • The attribute extractor returns the value of that tag's attribute.
  • The static extractor returns the configured static value (it doesn't extract anything).
  • See file list of extractors.

Extractors can require additional attributes on the selector hash.
👉 Read their docs for usage examples.

See a Ruby example
Html2rss.feed(
  channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  link:
    selector: 'a'
    extractor: 'href'

Using post processors

Extracted information can be further manipulated with post processors.

name
gsub Allows global substitution operations on Strings (Regexp or simple pattern).
html_to_markdown HTML to Markdown, using reverse_markdown.
markdown_to_html converts Markdown to HTML, using kramdown.
parse_time Parses a String containing a time in a time zone.
parse_uri Parses a String as URL.
sanitize_html Strips unsafe and uneeded HTML and adds security related attributes.
substring Cuts a part off of a String, starting at a position.
template Based on a template, it creates a new String filled with other selectors values.

⚠️ Always make use of the sanitize_html post processor for HTML content. Never trust the internet! ⚠️

👉 Read their docs for usage examples.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    description: {
      selector: '.content', post_process: { name: 'sanitize_html' }
    }
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  description:
    selector: '.content'
    post_process:
      - name: sanitize_html

Chaining post processors

Pass an array to post_process to chain the post processors.

YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML
channel:
  # ... omitted
selectors:
  # ... omitted
  price:
    selector: '.price'
  description:
    selector: '.section'
    post_process:
      - name: template
        string: |
          # %{self}

          Price: %{price}
      - name: markdown_to_html

Note the use of | for a multi-line String in YAML.

Adding <category> tags to an item

The categories selector takes an array of selector names. Each value of those selectors will become a <category> on the RSS item.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    genre: {
      # ... omitted
      selector: '.genre'
    },
    branch: { selector: '.branch' },
    categories: %i[genre branch]
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  genre:
    selector: ".genre"
  branch:
    selector: ".branch"
  categories:
    - genre
    - branch

Adding an <enclosure> tag to an item

An enclosure can be any file, e.g. a image, audio or video.

The enclosure selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.

Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:

  1. The content-type is guessed from the file extension of the URL.
  2. If the content-type guessing fails, it will default to application/octet-stream.
  3. The content-length will always be undetermined and thus stated as 0 bytes.

Read the RSS 2.0 spec for further information on enclosing content.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' }
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  enclosure:
    selector: "img"
    extractor: "attribute"
    attribute: "src"

Scraping and handling JSON responses

Although this gem is called html2rss, it's possible to scrape and process JSON.

Adding json: true to the channel config will convert the JSON response to XML.

See a Ruby example
Html2rss.feed(
  channel: {
    url: 'https://example.com', json: true
  },
  selectors: {} # ... omitted
)
See a YAML feed config example
channel:
  url: https://example.com
  json: true
selectors:
  # ... omitted
See example of a converted JSON object

This JSON object:

{
  "data": [{ "title": "Headline", "url": "https://example.com" }]
}

converts to:

<hash>
  <data>
    <datum>
      <title>Headline</title>
      <url>https://example.com</url>
    </datum>
  </data>
</hash>

Your items selector would be data > datum, the item's link selector would be url.

Find further information in ActiveSupport's Hash.to_xml documentation.

See example of a converted JSON array

This JSON array:

[{ "title": "Headline", "url": "https://example.com" }]

converts to:

<objects>
  <object>
    <title>Headline</title>
    <url>https://example.com</url>
  </object>
</objects>

Your items selector would be objects > object, the item's link selector would be url.

Find further information in ActiveSupport's Array.to_xml documentation.

Set any HTTP header in the request

You can add any HTTP headers to the request to the channel URL. Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.

See a Ruby example
Html2rss.feed(
  channel: {
    url: 'https://example.com',
    headers: {
      "User-Agent": "html2rss-request",
      "X-Something": "Foobar",
      "Authorization": "Token deadbea7",
      "Cookie": "monster=MeWantCookie"
    }
  },
  selectors: {}
)
See a YAML feed config example
channel:
  url: https://example.com
  headers:
    "User-Agent": "html2rss-request"
    "X-Something": "Foobar"
    "Authorization": "Token deadbea7"
    "Cookie": "monster=MeWantCookie"
selectors:
  # ...

The headers provided by the channel are merged into the global headers.

Usage with a YAML config file

This step is not required to work with this gem. If you're using html2rss-web and want to create your private feed configs, keep on reading!

First, create your YAML file, e.g. called feeds.yml. This file will contain your global config and feed configs.

Example:

headers:
  'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
  myfeed:
    channel:
    selectors:
  myotherfeed:
    channel:
    selectors:

Your feed configs go below feeds. Everything else is part of the global config.

Build your feeds like this:

require 'html2rss'

myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')

Find a full example of a feeds.yml at spec/config.test.yml.

Gotchas and tips & tricks

  • Check that the channel URL does not redirect to a mobile page with a different markup structure.
  • Do not rely on your web browser's developer console. html2rss does not execute JavaScript.
  • Fiddling with curl and pup to find the selectors seems efficient (curl URL | pup).
  • CSS selectors are quite versatile, here's an overview.

Development

After checking out the repository, run bin/setup to install dependencies. Then, run bundle exec rspec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

Releasing a new version
  1. git pull
  2. increase version in lib/html2rss/version.rb
  3. bundle
  4. git add Gemfile.lock lib/html2rss/version.rb
  5. VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')
  6. git commit -m "chore: release $VERSION"
  7. git tag v$VERSION
  8. standard-changelog -f
  9. git add CHANGELOG.md && git commit --amend
  10. git tag v$VERSION -f
  11. git push && git push --tags

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/gildesmarais/html2rss.