Releases: DigitalNZ/supplejack_common
v2.7.0
Add new Parser DSL pre_process_block
This optional block allows manipulation of the response data from your harvest source, before it is handed on to the rest of the parser as per normal. It could be used for any type of pre-processing data clean up requirements but was initially designed to rationalise verbose feeds that mentioned items multiple times, keeping only the latest mention to be harvested.
JSON example
pre_process_block do |rest_client_response|
# Convert RestClient::Response to Hash
hash = JSON.parse(rest_client_response.body)
# Sort and uniq the data will result in only the latest of each item
hash = hash.sort do |item_a, item_b|
# 'updated_at' specifies the date to sort on
Date.parse(item_b['updated_at']) <=> Date.parse(item_a['updated_at'])
end
.uniq { |item| item['audio_id'] } # 'audio_id' specifies the unique item ID to rationalise with
# Convert back to JSON
json = hash.to_json
# Return a new RestClient::Response with the new mutated JSON
RestClient::Response.create(json, rest_client_response.net_http_res, rest_client_response.request)
end
XML example
pre_process_block do |rest_client_response|
# Convert RestClient::Response to Nokogiri Document
doc = Nokogiri::XML(rest_client_response.body) { |config| config.options = Nokogiri::XML::ParseOptions::NOBLANKS }
# Select node that contains all items
items_node = doc.at_xpath('//dnz-export')
# Sorting by the "date" field
sorted = items_node.children.sort_by do |item|
item.children.find { |child| child.name == 'date' }.text
end.reverse!
# uniq will keep only the latest mention of each item based on the unique ID of that item (specified in "key")
uniq = sorted.uniq do |item|
item.children.find { |child| child.name == 'key' } .text
end
# Replace all children with new values
items_node.children.remove
uniq.each{ |n| items_node << n }
# Return a new rest response
RestClient::Response.create(doc.to_xml, rest_client_response.net_http_res, rest_client_response.request)
end
Include AWS S3 SDK Gem
Includes the AWS S3 SDK Gem as a dependency.
Allow Harvesting via a Proxy
This release adds support for passing proxy <url>
in your Parser Script.
Scroll Harvest
Added support for harvesting from an Elastic Search Scroll API endpoint.
Check out the docs for how to:
http://digitalnz.github.io/supplejack/manager/parser-dsl-domain-specific-language.html
Fix for 404 breaking the harvest
Fixed an issue where if a harvest hit a 404, it would break. Now it will happily continue.
Pagination Bug Fix
Altered the condition expression in the PaginatedCollection
service to be an less than <
rather than a less than equals =<
v2.3.0
Rubocop for code quality
Better code management
XML Tokenised pagination
This change enables parsing of xml and oai api's that use tokenised pagination.
Also, the tokenised pagination for all types is enabled with type: 'token'
. The old behaviour of setting tokenised: true
has been removed