Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrichment attributes are being saved to the wrong record #69

Merged
merged 11 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }
ruby '3.2.2'

# Bundle edge Rails instead: gem 'rails', github: 'rails/rails', branch: 'main'
gem 'rails', '~> 7.1.3.2'
gem 'rails', '~> 7.1.3.3'

# The original asset pipeline for Rails [https://github.com/rails/sprockets-rails]
gem 'sprockets-rails'
Expand All @@ -22,6 +22,8 @@ gem 'vite_rails'

gem 'rack-mini-profiler'

gem 'jaro_winkler'

# handles pagination
gem 'kaminari'

Expand Down
114 changes: 59 additions & 55 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -13,73 +13,73 @@ GIT
GEM
remote: https://rubygems.org/
specs:
actioncable (7.1.3.2)
actionpack (= 7.1.3.2)
activesupport (= 7.1.3.2)
actioncable (7.1.3.4)
actionpack (= 7.1.3.4)
activesupport (= 7.1.3.4)
nio4r (~> 2.0)
websocket-driver (>= 0.6.1)
zeitwerk (~> 2.6)
actionmailbox (7.1.3.2)
actionpack (= 7.1.3.2)
activejob (= 7.1.3.2)
activerecord (= 7.1.3.2)
activestorage (= 7.1.3.2)
activesupport (= 7.1.3.2)
actionmailbox (7.1.3.4)
actionpack (= 7.1.3.4)
activejob (= 7.1.3.4)
activerecord (= 7.1.3.4)
activestorage (= 7.1.3.4)
activesupport (= 7.1.3.4)
mail (>= 2.7.1)
net-imap
net-pop
net-smtp
actionmailer (7.1.3.2)
actionpack (= 7.1.3.2)
actionview (= 7.1.3.2)
activejob (= 7.1.3.2)
activesupport (= 7.1.3.2)
actionmailer (7.1.3.4)
actionpack (= 7.1.3.4)
actionview (= 7.1.3.4)
activejob (= 7.1.3.4)
activesupport (= 7.1.3.4)
mail (~> 2.5, >= 2.5.4)
net-imap
net-pop
net-smtp
rails-dom-testing (~> 2.2)
actionpack (7.1.3.2)
actionview (= 7.1.3.2)
activesupport (= 7.1.3.2)
actionpack (7.1.3.4)
actionview (= 7.1.3.4)
activesupport (= 7.1.3.4)
nokogiri (>= 1.8.5)
racc
rack (>= 2.2.4)
rack-session (>= 1.0.1)
rack-test (>= 0.6.3)
rails-dom-testing (~> 2.2)
rails-html-sanitizer (~> 1.6)
actiontext (7.1.3.2)
actionpack (= 7.1.3.2)
activerecord (= 7.1.3.2)
activestorage (= 7.1.3.2)
activesupport (= 7.1.3.2)
actiontext (7.1.3.4)
actionpack (= 7.1.3.4)
activerecord (= 7.1.3.4)
activestorage (= 7.1.3.4)
activesupport (= 7.1.3.4)
globalid (>= 0.6.0)
nokogiri (>= 1.8.5)
actionview (7.1.3.2)
activesupport (= 7.1.3.2)
actionview (7.1.3.4)
activesupport (= 7.1.3.4)
builder (~> 3.1)
erubi (~> 1.11)
rails-dom-testing (~> 2.2)
rails-html-sanitizer (~> 1.6)
activejob (7.1.3.2)
activesupport (= 7.1.3.2)
activejob (7.1.3.4)
activesupport (= 7.1.3.4)
globalid (>= 0.3.6)
activemodel (7.1.3.2)
activesupport (= 7.1.3.2)
activerecord (7.1.3.2)
activemodel (= 7.1.3.2)
activesupport (= 7.1.3.2)
activemodel (7.1.3.4)
activesupport (= 7.1.3.4)
activerecord (7.1.3.4)
activemodel (= 7.1.3.4)
activesupport (= 7.1.3.4)
timeout (>= 0.4.0)
activerecord-nulldb-adapter (1.0.1)
activerecord (>= 5.2.0, < 7.2)
activestorage (7.1.3.2)
actionpack (= 7.1.3.2)
activejob (= 7.1.3.2)
activerecord (= 7.1.3.2)
activesupport (= 7.1.3.2)
activestorage (7.1.3.4)
actionpack (= 7.1.3.4)
activejob (= 7.1.3.4)
activerecord (= 7.1.3.4)
activesupport (= 7.1.3.4)
marcel (~> 1.0)
activesupport (7.1.3.2)
activesupport (7.1.3.4)
base64
bigdecimal
concurrent-ruby (~> 1.0, >= 1.0.2)
Expand Down Expand Up @@ -203,6 +203,7 @@ GEM
irb (1.13.1)
rdoc (>= 4.0.0)
reline (>= 0.4.2)
jaro_winkler (1.5.6)
json (2.7.2)
jsonpath (1.1.5)
multi_json
Expand Down Expand Up @@ -298,30 +299,30 @@ GEM
rackup (2.1.0)
rack (>= 3)
webrick (~> 1.8)
rails (7.1.3.2)
actioncable (= 7.1.3.2)
actionmailbox (= 7.1.3.2)
actionmailer (= 7.1.3.2)
actionpack (= 7.1.3.2)
actiontext (= 7.1.3.2)
actionview (= 7.1.3.2)
activejob (= 7.1.3.2)
activemodel (= 7.1.3.2)
activerecord (= 7.1.3.2)
activestorage (= 7.1.3.2)
activesupport (= 7.1.3.2)
rails (7.1.3.4)
actioncable (= 7.1.3.4)
actionmailbox (= 7.1.3.4)
actionmailer (= 7.1.3.4)
actionpack (= 7.1.3.4)
actiontext (= 7.1.3.4)
actionview (= 7.1.3.4)
activejob (= 7.1.3.4)
activemodel (= 7.1.3.4)
activerecord (= 7.1.3.4)
activestorage (= 7.1.3.4)
activesupport (= 7.1.3.4)
bundler (>= 1.15.0)
railties (= 7.1.3.2)
railties (= 7.1.3.4)
rails-dom-testing (2.2.0)
activesupport (>= 5.0.0)
minitest
nokogiri (>= 1.6)
rails-html-sanitizer (1.6.0)
loofah (~> 2.21)
nokogiri (~> 1.14)
railties (7.1.3.2)
actionpack (= 7.1.3.2)
activesupport (= 7.1.3.2)
railties (7.1.3.4)
actionpack (= 7.1.3.4)
activesupport (= 7.1.3.4)
irb
rackup (>= 1.0.0)
rake (>= 12.2)
Expand All @@ -345,7 +346,8 @@ GEM
mime-types (>= 1.16, < 4.0)
netrc (~> 0.8)
retriable (3.1.2)
rexml (3.2.6)
rexml (3.2.8)
strscan (>= 3.0.9)
rotp (6.3.0)
rqrcode (2.2.0)
chunky_png (~> 1.0)
Expand Down Expand Up @@ -434,6 +436,7 @@ GEM
activesupport (>= 5.2)
sprockets (>= 3.0.0)
stringio (3.1.0)
strscan (3.1.0)
thor (1.3.1)
timeout (0.4.1)
tzinfo (2.0.6)
Expand Down Expand Up @@ -486,6 +489,7 @@ DEPENDENCIES
faraday (~> 2.9)
faraday-follow_redirects
foreman
jaro_winkler
jsonpath
kaminari
letter_opener
Expand All @@ -494,7 +498,7 @@ DEPENDENCIES
pry-byebug
puma (~> 6.0)
rack-mini-profiler
rails (~> 7.1.3.2)
rails (~> 7.1.3.3)
retriable
rqrcode
rspec-rails
Expand Down
2 changes: 1 addition & 1 deletion app/controllers/extraction_definitions_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def extraction_definition_params
safe_params = params.require(:extraction_definition).permit(
:pipeline_id, :name, :format, :base_url, :throttle, :page, :per_page,
:total_selector, :kind, :destination_id, :source_id, :enrichment_url, :paginated, :split, :split_selector,
:extract_text_from_file
:extract_text_from_file, :fragment_source_id, :fragment_key
)
merge_last_edited_by(safe_params)
end
Expand Down
2 changes: 1 addition & 1 deletion app/models/pipeline_job.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
class PipelineJob < ApplicationRecord
include Job

serialize :harvest_definitions_to_run, Array
serialize :harvest_definitions_to_run, type: Array

belongs_to :pipeline
belongs_to :extraction_job, optional: true
Expand Down
2 changes: 1 addition & 1 deletion app/models/schedule.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ class Schedule < ApplicationRecord
belongs_to :destination
has_many :pipeline_jobs, dependent: :nullify

serialize :harvest_definitions_to_run, Array
serialize :harvest_definitions_to_run, type: Array

validates :name, presence: true, uniqueness: true
validates :frequency, presence: true
Expand Down
7 changes: 6 additions & 1 deletion app/supplejack/extraction/documents.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,13 @@ def in_bounds?(current_page)
current_page.in?(1..documents_filepath.length)
end

# The enrichments rely on the files being ordered by page number
# so that the index [2005] gives back page 2005 etc.
# If the pages and indexes do not match up, records will be enriched with data that is not meant for them
def documents_filepath
@documents_filepath ||= Dir.glob("#{@folder}/*.json")
@documents_filepath ||= Dir.glob("#{@folder}/*.json").sort_by do |page|
page.match(/__(?<record_id>.+)__(?<page>.+).json/)[:page].to_i
end
end
end
end
12 changes: 12 additions & 0 deletions app/supplejack/extraction/enrichment_extraction.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,14 @@ def valid?
private

def url
return fragment_url if @extraction_definition.fragment_source_id.present?

@request.url(@record)
end

def params
return {} if @extraction_definition.fragment_source_id.present?

@request.query_parameters(@record)
end

Expand All @@ -36,5 +40,13 @@ def file_path
page_str = format('%09d', @page)[-9..]
"#{@extraction_folder}/#{name_str}__#{@record['id']}__#{page_str}.json"
end

def fragment_url
url = @record['fragments'].find do |fragment|
fragment['source_id'] == @extraction_definition.fragment_source_id
end[@extraction_definition.fragment_key]

[*url].first
end
end
end
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,32 @@
<%= form.select :format, options, {}, class: 'form-select' %>
</div>

<div class="col-4">
<%= form.label :format, class: 'form-label' do %>
Fragment Source ID
<span
data-bs-toggle="tooltip"
data-bs-title="The Source ID of the fragment that contains the complete URL to use for the enrichment. Note that if you put a value in this field query and slug of the request will be ignored."></i>
</span>
<% end %>
</div>
<div class="col-8">
<%= form.text_field :fragment_source_id, class: 'form-control' %>
</div>

<div class="col-4">
<%= form.label :format, class: 'form-label' do %>
Fragment Key
<span
data-bs-toggle="tooltip"
data-bs-title="The key of the fragment that contains the complete URL to use for the enrichment."></i>
</span>
<% end %>
</div>
<div class="col-8">
<%= form.text_field :fragment_key, class: 'form-control' %>
</div>

<div class="col-4">
<%= form.label :destination_id, class: 'form-label' do %>
Preview harvested records from
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# frozen_string_literal: true

class AddSourceIdAndFragmentKeyToExtractionDefinition < ActiveRecord::Migration[7.0]
def change
add_column :extraction_definitions, :fragment_source_id, :string
add_column :extraction_definitions, :fragment_key, :string
end
end
4 changes: 3 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions spec/sidekiq/text_extraction_worker_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
describe "#perform" do
context 'when the PDF extraction is not part of a harvest' do
before do
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example.json")
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example__1234__01.json")
end

it 'converts a PDF into raw text' do
Expand All @@ -36,7 +36,7 @@
it 'names the new files following as it was originally' do
TextExtractionWorker.new.perform(extraction_job.id)

expect(File.exist?("#{extraction_job.extraction_folder}/example.json")).to eq(true)
expect(File.exist?("#{extraction_job.extraction_folder}/example__1234__01.json")).to eq(true)
end

it 'does not enqueue Transformation Workers' do
Expand All @@ -48,7 +48,7 @@

context 'when the PDF extraction is part of a harvest' do
before do
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example.json")
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example__1234__01.json")
end

let!(:harvest_report) { create(:harvest_report, pipeline_job:, harvest_job:) }
Expand Down Expand Up @@ -77,7 +77,7 @@

context 'when the PDF extraction is part of an enrichment' do
before do
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example__1234__.json")
FileUtils.cp("#{Rails.root}/spec/support/example.pdf", "#{extraction_job.extraction_folder}/example__1234__01.json")
end

let(:destination) { create(:destination) }
Expand Down
Loading
Loading