Skip to content

Ruby MediaWiki API and Page content parser for Headlines, Blocks, Elements, ListItems, and Links.

License

Notifications You must be signed in to change notification settings

dblommesteijn/wiki-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki::Api

Build Status Code Climate

Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.

NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: Page, Headline, Block, ListItem, and Link are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.

Requests to the MediaWiki API use the following URI structure:

http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"

Dependencies

  • nokogiri

Installation

Add this line to your application's Gemfile (bundler):

gem 'wiki-api', git: "git://github.com/dblommesteijn/wiki-api.git"

And then execute:

$ bundle

Or install it yourself (RubyGems):

$ gem install wiki-api

Or try it from this repository (local) in a console:

$ bin/console

Setup

Define a configuration for your connection (initialize script), this example uses wiktionary.org. NOTE: it can connect to both HTTP and HTTPS MediaWikis (however you'll get a 302 response from MediaWiki)

Setup default configuration (initialize script)

Wiki::Api::Connect.config = { uri: 'https://en.wiktionary.org' }

Running tests

$ rake test

Usage

Query a Page and Headline

Requesting headlines from a given page.

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# the root headline equals the pagename
puts page.root_headline.name
# iterate next level of headlines
page.root_headline.headlines.each do |headline_name, headline|
  # printing headline name (PageHeadline)
  puts headline.name
end

Getting headlines for a given name.

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# lookup headline by name (underscore and case are ignored)
headline = page.root_headline.headline('editing wiktionary').first
# printing headline name (PageHeadline)
puts headline.name
# get the type of nested headline (html h1,2,3,4 etc.)
puts headline.type

Basic Page structure

page = Wiki::Api::Page.new(name: 'Wiktionary:Welcome,_newcomers')
# iterate PageHeadline objects
page.root_headline.headlines.each do |headline_name, headline|
  # exposing nokogiri internal elements
  elements = headline.elements.flatten
  elements.each do |element|
    # print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
    puts element.class
  end

  # string representation of all nested text
  block.to_texts
  # iterate PageListItem objects
  block.list_items.each do |list_item|
    # string representation of nested text
    list_item.to_text
    # iterate PageLink objects
    list_item.links.each do |link|
      # check part: 'iterate PageLink objects'
    end
  end

  # iterate PageLink objects
  headline.block.links.each do |link|
    # absolute URI object
    link.uri
    # html link
    link.html
    # link name
    link.title
    # string representation of nested text
    link.to_text
  end
end

Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)

This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.

# setting a target config
Wiki::Api::Connect.config = { uri: 'https://en.wikipedia.org' }

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails')

# get headlines with name Reference (there can be multiple headlines with the same name!)
headlines = page.root_headline.headline('References')

# iterate headlines
headlines.each do |headline|
  # iterate list items on the given headline
  headline.block.list_items.each do |list_item|
    # print the uri of all links
    puts list_item.links.map(&:uri)
  end
end

This is the same example as the one above, except for setting a global config to direct the requests to a given URI.

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')

# get headlines with name Reference (there can be multiple headlines with the same name!)
headlines = page.root_headline.headline('References')

# iterate headlines
headlines.each do |headline|
  # iterate list items on the given headline
  headline.block.list_items.each do |list_item|
    # print the uri of all links
    puts list_item.links.map(&:uri)
  end
end

Example searching headlines

This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')

# NOTE: the following are all valid headline names:
# request headline (by literal name)
headlines = page.root_headline.headline('Philosophy_and_design')
puts headlines.map(&:name)
# request headline (by downcase name)
headlines = page.root_headline.headline('philosophy_and_design')
puts headlines.map(&:name)
# request headline (by human name)
headlines = page.root_headline.headline('philosophy and design')
puts headlines.map(&:name)

# NOTE2: headlines are matched on headline.start_with?(requested_headline)
# because of start_with? compare this should work as well!
headlines = page.root_headline.headline('philosophy')
puts headlines.map(&:name)

Example searching headlines in depth

Recursive search on all nested headlines, including in depth searches.

# querying the page
page = Wiki::Api::Page.new(name: 'Ruby_on_Rails', uri: 'https://en.wikipedia.org')
# get root
root_headline = page.root_headline
# lookup 'ramework structure' on current level
headline = root_headline.headline_in_depth('framework structure').first
puts headline.name
# NOTE: lookup of nested headlines does not work with the headline function (because 'Framework_structure' is nested within 'Technical_overview')
headline = root_headline.headline('framework structure').first
# depth can be limited adding the depth parameter
# NOTE: the example below will return nil, 'Framework_structure' is nested beyond depth = 0!
depth = 0
headline = root_headline.headline_in_depth('framework structure', depth).first
# increasing depth search will show the requested headline
depth = 5
headline = root_headline.headline_in_depth('framework structure', depth).first
puts headline.name

About

Ruby MediaWiki API and Page content parser for Headlines, Blocks, Elements, ListItems, and Links.

Resources

License

Stars

Watchers

Forks

Packages

No packages published