Note: Traject::Marc4JReader
is for JRuby only.
Traject::Marc4JReader
is a reader for the traject ETL system
that allows the use of marc4j as a reader when dealing with MARC
binary or MARC-XML files. It is of no use outside of traject
run under JRuby.
It leverages marc-marc4j, which is a paper-thin wrapper around
the Marc4J .jar
that is shipped with it.
The output of the reader is a vanilla ruby-marc object. You can hang onto the
original marc4j java object with the marc4j_reader.keep_marc4j
setting.
The biggest reason would be for faster MARC/MARC-XML parsing and generation than the vanilla marc gem can provide, or if you need to do something wacky with the marc4j internal structure (such as feed it to legacy java code you have lying around).
In general, the marc4j library will parse marc21 (binary) and MARC-XML roughly twice as fast as the pure-ruby library. While MARC parsing tends to not be a huge part of the workload in a traject run, you'll almost certainly see performance gains.
Traject prior to 3.0 included this as a dependency on JRuby, and defaulted to using it.
In Traject 3.0+, you need to manually add this gem and configure to use it.
If you are using bundler and a Gemfile
, add gem "traject-marc4j_reader", "~> 1.0"
to your Gemfile
. Otherwise, just gem install traject-marc4j_reader
.
Then, in your traject config file:
# Instead of require in config file, you could use the `-r` traject
# command-line option.
require 'traject/marc4j_reader'
settings do
provide "reader_class_name", "Traject::Marc4JReader"
# Recommend marc4j_reader.permissive true unless you have reason not to.
# true was default provided by core traject gem in Traject pre-3.0, but isn't
# anymore in traject 3.0 -- so set to true explicitly to maintain behavior
#
# Only relevant for binary MARC source data.
provide "marc4j_reader.permissive", true
end
For more about the traject settings
object, see the traject settings documentation
Note that the standard Marc4JReader always converts to UTF8, so output will always reflect that conversion.
-
marc4j.jar_dir
: Path to a directory containing Marc4J jar file to use. All .jar's in dir will be loaded. If unset, uses marc4j.jar bundled with traject. -
marc4j_reader.permissive
: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default false, but recommend true for most uses. -
marc4j_reader.source_encoding
: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC" -
marc4j_reader.keep_marc4j
: After translating the marc4j record into a normal ruby-marc object, provides access to the former viarecord#original_marc4j
. -
'marc4j_reader.class': Set to eg 'MarcStreamReader' to use that more strict Marc4J reader class, instead of the default Marc4J
MarcPermissiveStreamReader
.
A simple example that reads in via marc4j and outputs to the newline-delimited-json writer.
Use would be:
traject -c id_title.rb my_marc_file.mrc
# File id_title.rb
require 'traject'
require 'traject/marc4j_reader'
require 'traject/json_writer'
require 'traject/macros/marc21_semantics'
extend Traject::Macros::Marc21Semantics
settings do
provide "reader_class_name", "Traject::Marc4JReader"
provide "marc4j_reader.keep_marc4j", true
provide "writer_class_name", "Traject::JsonWriter"
provide "output_file", "ids_and_titles.ndj"
end
to_field "id", extract_marc("001", :first => true)
to_field "title", extract_marc_filing_version('245abdefghknp', :include_original => true)
- Fork it ( https://github.com/[my-github-username]/traject_marc4j_reader/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request