Skip to content
This repository has been archived by the owner on Nov 5, 2018. It is now read-only.

The next version of Prism

mwunsch edited this page Apr 30, 2013 · 2 revisions

These are my notes on building out the next version of Prism.

Currently on v0.1.0, Prism is in dire need of an update. Part of what's held me back in writing this update is my inability to articulate what it is I want Prism to be.

First and foremost, Prism is set of Ruby libraries for describing a declarative parser for extracting Plain Ruby Objects out of HTML (and to a lesser extent XML).

The Prism DSL

The primary use-case of such a library is to be able to quickly write parsers that can extract meaningful semantic data out of HTML documents, such as Microformats and Schema.org.

I am unhappy with the current parser, defined in both the Prism module (https://github.com/mwunsch/prism/blob/master/lib/prism.rb) and the Prism::POSH class (https://github.com/mwunsch/prism/blob/master/lib/prism/posh.rb).

I think the first thing to do is define a better parsing DSL to search and find certain elements in the document, transform them into an AST of some sort, and be able to visit over that tree to transform it into a meaningful Ruby object.

Take for example, a simple time HTML element (https://developer.mozilla.org/en-US/docs/HTML/HTML_Elements/time).

This element is intended to be used presenting dates and times in a machine readable format.

We should be able to easily declare a parser to extract the machine readable format and transform it into a Ruby Time or DateTime object. Given some html:

<p>The concert took place on <time datetime="2001-05-15 19:00">May 15</time>.</p>

We should be able to write a simple Prism::Time parser, especially given that a time elements parsing algorithm is well defined: http://www.whatwg.org/specs/web-apps/current-work/#the-time-element

How can we write this parser in a declarative, approachable fashion? In plain english: search the document for a time element. For each element found, first look for a datetime attribute, else look inside the element's text content and apply the algorithm.

Here's what that declaration DSL might look like, using parts of the Nokogiri::XML::Node API

has_many :time do |el|
  whatWgDateTimeAlgorithm(el['datetime'] || el.content)
end

We can then surmise that calling this object's #time method returns an Enumerable containing the return value of whatWgDateTimeAlgorithm applied to the found nodes.

What if we want the object's method to be a name different than "time"? We should be explicit about the selectors we want:

has_many(:published_times, "time[itemprop='pubdate']") {|el| algo(el['datetime'] || el.content) }

The #published_times method would return an Enumerable containing the return value of algo applied only those elements that match the selector.

In other words, if the second "selector" argument is absent, we assume the method name is the name of the element we are searching for.

Failure and Validation

What happens if the el is absent? Or if #algo returns nil? This is likely acceptable for some algorithms, but not always. In those cases, we should use a different method to define parsers that need to be fulfilled:

has_many!(:published_times, "time[itemprop='pubdate']") {|el| algo(el['datetime'] || el.content) }

In using the has_many! method, with a bang, we suggest that if el is nil or the return value of absent is nil, we should raise an Exception.

Nested elements

Ideally, the return value of a block either returns a Plain Ruby Object representing the "value", or includes the Parser module itself to form a tree. Given an element like a "section", where it's likely to find nested sections we want to explicitly say we are not interested in descendent elements matching the same selector.

Given an HTML document:

<!-- http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#outlines -->
<body>
 <h1>Apples</h1>
 <p>Apples are fruit.</p>
 <section>
  <h1>Taste</h1>
  <p>They taste lovely.</p>
  <section>
   <h1>Sweet</h1>
   <p>Red apples are sweeter than green ones.</p>
  </section>
 </section>
 <section>
  <h1>Color</h1>
  <p>Apples come in various colors.</p>
 </section>
</body>

And the Prism code:

has_many(:sections, "section") {|el| el.at_css('h1').content }

#sections would return ["Taste", "Sweet", "Color"]. We only want the top level sections (["Taste", "Color"]), not the descendants. How do we declare this in an intuitive way?

In pure Nokogiri:

html.css("section").map {|el| el.at_css('h1').content }
#=> ["Taste", "Sweet", "Color"]

html.css("section").reject {|el| !el.ancestors('section').size.zero? }.map {|el| el.at_css('h1').content }
#=> ["Taste", "Color"]

Maybe we have a method on the Parser called reject_descendants? Still unsure of the best way to do this without it being hokey. How do we dip down into Nokogiri when we need to without it being awkward? Since we're already exposing Nokogiri elements anyway, maybe it's fine to do something like this:

has_many(:sections, "section") do |el, selector|
  next unless el.ancestors(selector).size.zero?
  el.at_css('h1').content
end

This means the has_many implementation looks something like this:

def has_many(name, selector = nil)
  selector ||= name.to_s
  some_metaprogramming_magic(name) do
    html.css(selector).map do |node| 
      block_given? ? yield(node, selector) : node.content.strip
    end.compact
  end
end

Feels okay to me.

DSL Summary

The suggestion moving forward is to simplify Prism's parsing module. The main methods are has_many and has_one, which take an element name an optional selector and optional block. Those are dynamically transformed into methods which contain either an Enumerable of the return values of the block in the case of has_many or the return value of the block applied to the first matched element in the case of has_one. If a block is not provided, the return value is assumed to be the content of the element.

From these two methods we can build many more, such as the has_many! bang methods for validation.

The parser should be simplified into a single module, and should contain methods that make it easy to compose trees of values in a monad kind of way.

How to parse

There are two methods for parsing, a DOM parser, where the document tree is loaded into memory, or a SAX Parser, where the document is treated as a stream of events. Nokogiri provides us a mechanism to listen to only the events we care about with Nokogiri::XML::SAX.

SAX Parsing is likely faster, but in order to use the declarative DSL we have to manage this stack. Paul Dix's Sax Machine does this quite well, and provides a fantastic declarative API. But I think being able to use css selectors to seek elements is crucial for developers to quickly grok Prism, especially since Prism's primary mode of operation is in the scary wilderness of HTML. Which means that in addition to managing a stack of events to build a tree, we need to be able to parse a CSS selector and transform it in such a way that we can tell the SAX Parser to listen to events that match this selector. This is possible, as Nokogiri has a built-in CSS Parser. The trick is to be able to take the data we have in the Sax parser (start_element gives us an element name and an Array of String pairs representing the element attributes) and match it up to the AST we can get out of the CSS Parser, which looks like this:

Nokogiri::CSS.parse("time[itemprop='pubdate']").map {|node| node.to_a }
#=> [[:CONDITIONAL_SELECTOR, [:ELEMENT_NAME, ["time"]], [:ATTRIBUTE_CONDITION, [:ELEMENT_NAME, ["itemprop"]], [:equal], ["'pubdate'"]]]]

This seems doable, though not trivial. My current feeling is that this work is worthwhile for the community as a whole. SAX parsing is awesome. I think, for now, Prism will continue to operate on the DOM, until we can build out the proper stack handling and traversal for SAX.

The Tools

With the Prism DSL, we should be able to do things like quickly describe parsers for Microformats, Microformats 2, Schema.org, various Syndication formats (RSS, ATOM, hAtom), and finally implement a HTML5 section outlining algorithm.

Prism will come fully stocked with popular Micoformat, Schema, and RSS classes. These parsers will parse HTML/XML into objects with methods to transform into vcard, ical, atom (hAtom -> Atom), etc.

The CLI

Finally, Prism will come with a new CLI built on Thor that can operate on urls and standard streams.