Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full LENZ ingest #63

Closed
wants to merge 33 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
49021ad
add PDF as an available option for the extraction definition
richardmatthewsdev Mar 5, 2024
7b79fd0
feat(pdf_extraction): Add the ability to extract text from a PDF
richardmatthewsdev Mar 5, 2024
6910597
(refactor) rename the PDF extraction worker to a text extraction work…
richardmatthewsdev Mar 5, 2024
07c5b5e
feat(extract_from_pdf): Initial Working for extracting text from a PD…
richardmatthewsdev Mar 5, 2024
a22bf83
Add new fields to the extraction definition to determine if a file ne…
richardmatthewsdev Mar 6, 2024
f11fc4d
Remove the credentials file from Git
richardmatthewsdev Mar 6, 2024
b2f8163
Update the harvester to be able to display and transform text extract…
richardmatthewsdev Mar 6, 2024
a4fe99c
Add Text from file extraction to enrichments
richardmatthewsdev Mar 6, 2024
9e2ddab
feat(pdf_extraction): Allow the File Extraction to run as part of an …
richardmatthewsdev Mar 6, 2024
ea20ebf
ci: Remove mirror to Gitlab workflow
eoin-boost Mar 6, 2024
17e84ea
ERB lint and rubocop
richardmatthewsdev Mar 6, 2024
42ce3b9
Merge branch 'rm/extract-data-from-documents' of github.com:boost/pco…
richardmatthewsdev Mar 6, 2024
a630943
Bundle audit update
richardmatthewsdev Mar 6, 2024
cf5aeed
Update brakeman
richardmatthewsdev Mar 6, 2024
034ec7a
Rubocop and erb lint
richardmatthewsdev Mar 7, 2024
6e0bfd8
Remove needing the credentials file
richardmatthewsdev Mar 7, 2024
8b3d978
feat(extract_pdf): Do not display the preview for extractions that ne…
richardmatthewsdev Mar 7, 2024
674a2ea
refactor(extract_pdf): Reduce duplication in transformed record class
richardmatthewsdev Mar 7, 2024
36981a6
Prettier
richardmatthewsdev Mar 7, 2024
3c30fca
refactor(extract_pdf): Write binary files
richardmatthewsdev Mar 7, 2024
72ee9ca
Update dockerfile
richardmatthewsdev Mar 7, 2024
3244a56
Add the secret key as an argument to the harvester pipeline
richardmatthewsdev Mar 7, 2024
5fd76d5
rename secret key to secret key base
richardmatthewsdev Mar 7, 2024
535a41f
Correct typo
richardmatthewsdev Mar 7, 2024
04f47db
Fix issue with secret key base naming
richardmatthewsdev Mar 7, 2024
a9bec5c
fix(pdf_extraction): Improve disabled checking for enrichment and har…
richardmatthewsdev Mar 7, 2024
bb1c864
Add missing question mark to completed method
richardmatthewsdev Mar 7, 2024
6840a66
Stop the extraction show page from crashing while the text extraction…
richardmatthewsdev Mar 11, 2024
19ef18c
Dont mark the extraction as complete when it is not
richardmatthewsdev Mar 11, 2024
9e0d3ab
Fix for files that need to be extracted saying that they are complete…
richardmatthewsdev Mar 11, 2024
c74922f
Fix failing spec and rubocop
richardmatthewsdev Mar 11, 2024
41c7cb6
Merge pull request #2 from boost/rm/extract-data-from-documents
richardmatthewsdev Mar 11, 2024
8d6dce4
Add fields required for S3 extraction to the extraction definition
richardmatthewsdev Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 0 additions & 17 deletions .github/workflows/mirror_to_gitlab.yml

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@

# Ignore master key for decrypting credentials and more.
/config/master.key
/config/credentials.yml.enc

/app/assets/builds/*
!/app/assets/builds/.keep
Expand Down
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ FROM ruby:3.2.2-alpine3.18
WORKDIR /app

ARG BUILD_PACKAGES="build-base curl-dev git"
ARG DEV_PACKAGES="bash mysql-client mariadb-dev yaml-dev zlib-dev nodejs yarn libxml2 libxml2-dev libxslt libxslt-dev gmp-dev"
ARG DEV_PACKAGES="bash mysql-client mariadb-dev yaml-dev zlib-dev nodejs yarn libxml2 libxml2-dev libxslt libxslt-dev gmp-dev openjdk8-jre"
ARG RUBY_PACKAGES="tzdata"

WORKDIR /app
Expand Down Expand Up @@ -34,6 +34,8 @@ ARG RAILS_ENV="production"
ENV RAILS_ENV=$RAILS_ENV
ARG RAILS_MASTER_KEY
ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY
ARG RAILS_SECRET_KEY_BASE
ENV SECRET_KEY_BASE=$RAILS_SECRET_KEY_BASE

RUN bundle exec rails assets:precompile

Expand Down
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ gem 'faraday-follow_redirects'
gem 'jsonpath'
gem 'nokogiri'
gem 'sidekiq'
gem 'yomu'

# transformation related
gem 'webmock'
Expand Down
150 changes: 76 additions & 74 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -13,77 +13,77 @@ GIT
GEM
remote: https://rubygems.org/
specs:
actioncable (7.0.7.2)
actionpack (= 7.0.7.2)
activesupport (= 7.0.7.2)
actioncable (7.0.8.1)
actionpack (= 7.0.8.1)
activesupport (= 7.0.8.1)
nio4r (~> 2.0)
websocket-driver (>= 0.6.1)
actionmailbox (7.0.7.2)
actionpack (= 7.0.7.2)
activejob (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
actionmailbox (7.0.8.1)
actionpack (= 7.0.8.1)
activejob (= 7.0.8.1)
activerecord (= 7.0.8.1)
activestorage (= 7.0.8.1)
activesupport (= 7.0.8.1)
mail (>= 2.7.1)
net-imap
net-pop
net-smtp
actionmailer (7.0.7.2)
actionpack (= 7.0.7.2)
actionview (= 7.0.7.2)
activejob (= 7.0.7.2)
activesupport (= 7.0.7.2)
actionmailer (7.0.8.1)
actionpack (= 7.0.8.1)
actionview (= 7.0.8.1)
activejob (= 7.0.8.1)
activesupport (= 7.0.8.1)
mail (~> 2.5, >= 2.5.4)
net-imap
net-pop
net-smtp
rails-dom-testing (~> 2.0)
actionpack (7.0.7.2)
actionview (= 7.0.7.2)
activesupport (= 7.0.7.2)
actionpack (7.0.8.1)
actionview (= 7.0.8.1)
activesupport (= 7.0.8.1)
rack (~> 2.0, >= 2.2.4)
rack-test (>= 0.6.3)
rails-dom-testing (~> 2.0)
rails-html-sanitizer (~> 1.0, >= 1.2.0)
actiontext (7.0.7.2)
actionpack (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
actiontext (7.0.8.1)
actionpack (= 7.0.8.1)
activerecord (= 7.0.8.1)
activestorage (= 7.0.8.1)
activesupport (= 7.0.8.1)
globalid (>= 0.6.0)
nokogiri (>= 1.8.5)
actionview (7.0.7.2)
activesupport (= 7.0.7.2)
actionview (7.0.8.1)
activesupport (= 7.0.8.1)
builder (~> 3.1)
erubi (~> 1.4)
rails-dom-testing (~> 2.0)
rails-html-sanitizer (~> 1.1, >= 1.2.0)
activejob (7.0.7.2)
activesupport (= 7.0.7.2)
activejob (7.0.8.1)
activesupport (= 7.0.8.1)
globalid (>= 0.3.6)
activemodel (7.0.7.2)
activesupport (= 7.0.7.2)
activerecord (7.0.7.2)
activemodel (= 7.0.7.2)
activesupport (= 7.0.7.2)
activemodel (7.0.8.1)
activesupport (= 7.0.8.1)
activerecord (7.0.8.1)
activemodel (= 7.0.8.1)
activesupport (= 7.0.8.1)
activerecord-nulldb-adapter (0.9.0)
activerecord (>= 5.2.0, < 7.1)
activestorage (7.0.7.2)
actionpack (= 7.0.7.2)
activejob (= 7.0.7.2)
activerecord (= 7.0.7.2)
activesupport (= 7.0.7.2)
activestorage (7.0.8.1)
actionpack (= 7.0.8.1)
activejob (= 7.0.8.1)
activerecord (= 7.0.8.1)
activesupport (= 7.0.8.1)
marcel (~> 1.0)
mini_mime (>= 1.1.0)
activesupport (7.0.7.2)
activesupport (7.0.8.1)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 1.6, < 2)
minitest (>= 5.1)
tzinfo (~> 2.0)
addressable (2.8.4)
public_suffix (>= 2.0.2, < 6.0)
ast (2.4.2)
bcrypt (3.1.18)
bcrypt (3.1.20)
better_html (2.0.2)
actionview (>= 6.0)
activesupport (>= 6.0)
Expand All @@ -108,16 +108,16 @@ GEM
chunky_png (1.4.0)
coderay (1.1.3)
colorize (1.1.0)
concurrent-ruby (1.2.2)
concurrent-ruby (1.2.3)
connection_pool (2.4.1)
crack (0.4.5)
rexml
crass (1.0.6)
date (3.3.3)
date (3.3.4)
debug (1.7.2)
irb (>= 1.5.0)
reline (>= 0.3.1)
devise (4.9.2)
devise (4.9.3)
bcrypt (~> 3.0)
orm_adapter (~> 0.1)
railties (>= 4.1.0)
Expand All @@ -128,7 +128,7 @@ GEM
devise (~> 4.0)
railties (~> 7.0)
rotp (~> 6.0)
devise_invitable (2.0.8)
devise_invitable (2.0.9)
actionmailer (>= 5.0)
devise (>= 4.6)
diff-lcs (1.5.0)
Expand Down Expand Up @@ -171,7 +171,7 @@ GEM
fugit (1.8.1)
et-orbi (~> 1, >= 1.2.7)
raabro (~> 1.4)
globalid (1.2.0)
globalid (1.2.1)
activesupport (>= 6.1)
hashdiff (1.0.1)
http (5.1.1)
Expand All @@ -183,7 +183,7 @@ GEM
http-cookie (1.0.5)
domain_name (~> 0.5)
http-form_data (2.3.0)
i18n (1.14.1)
i18n (1.14.4)
concurrent-ruby (~> 1.0)
io-console (0.6.0)
irb (1.6.4)
Expand All @@ -210,32 +210,32 @@ GEM
llhttp-ffi (0.4.0)
ffi-compiler (~> 1.0)
rake (~> 13.0)
loofah (2.21.3)
loofah (2.22.0)
crass (~> 1.0.2)
nokogiri (>= 1.12.0)
mail (2.8.1)
mini_mime (>= 0.1.1)
net-imap
net-pop
net-smtp
marcel (1.0.2)
marcel (1.0.4)
matrix (0.4.2)
method_source (1.0.0)
mime-types (3.4.1)
mime-types-data (~> 3.2015)
mime-types-data (3.2023.0218.1)
mini_mime (1.1.5)
minitest (5.19.0)
minitest (5.22.2)
multi_json (1.15.0)
mysql2 (0.5.5)
net-imap (0.3.7)
mysql2 (0.5.6)
net-imap (0.4.10)
date
net-protocol
net-pop (0.1.2)
net-protocol
net-protocol (0.2.1)
net-protocol (0.2.2)
timeout
net-smtp (0.3.3)
net-smtp (0.4.0.1)
net-protocol
netrc (0.11.0)
nio4r (2.7.0)
Expand All @@ -261,49 +261,49 @@ GEM
nio4r (~> 2.0)
raabro (1.4.0)
racc (1.7.3)
rack (2.2.8)
rack (2.2.8.1)
rack-mini-profiler (3.3.0)
rack (>= 1.2.0)
rack-proxy (0.7.6)
rack
rack-test (2.1.0)
rack (>= 1.3)
rails (7.0.7.2)
actioncable (= 7.0.7.2)
actionmailbox (= 7.0.7.2)
actionmailer (= 7.0.7.2)
actionpack (= 7.0.7.2)
actiontext (= 7.0.7.2)
actionview (= 7.0.7.2)
activejob (= 7.0.7.2)
activemodel (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
rails (7.0.8.1)
actioncable (= 7.0.8.1)
actionmailbox (= 7.0.8.1)
actionmailer (= 7.0.8.1)
actionpack (= 7.0.8.1)
actiontext (= 7.0.8.1)
actionview (= 7.0.8.1)
activejob (= 7.0.8.1)
activemodel (= 7.0.8.1)
activerecord (= 7.0.8.1)
activestorage (= 7.0.8.1)
activesupport (= 7.0.8.1)
bundler (>= 1.15.0)
railties (= 7.0.7.2)
railties (= 7.0.8.1)
rails-dom-testing (2.2.0)
activesupport (>= 5.0.0)
minitest
nokogiri (>= 1.6)
rails-html-sanitizer (1.6.0)
loofah (~> 2.21)
nokogiri (~> 1.14)
railties (7.0.7.2)
actionpack (= 7.0.7.2)
activesupport (= 7.0.7.2)
railties (7.0.8.1)
actionpack (= 7.0.8.1)
activesupport (= 7.0.8.1)
method_source
rake (>= 12.2)
thor (~> 1.0)
zeitwerk (~> 2.5)
rainbow (3.1.1)
rake (13.0.6)
rake (13.1.0)
redis-client (0.17.0)
connection_pool
regexp_parser (2.8.0)
reline (0.3.3)
io-console (~> 0.5)
responders (3.1.0)
responders (3.1.1)
actionpack (>= 5.2)
railties (>= 5.2)
rest-client (2.1.0)
Expand Down Expand Up @@ -394,8 +394,8 @@ GEM
actionpack (>= 5.2)
activesupport (>= 5.2)
sprockets (>= 3.0.0)
thor (1.2.2)
timeout (0.4.0)
thor (1.3.1)
timeout (0.4.1)
tzinfo (2.0.6)
concurrent-ruby (~> 1.0)
unf (0.1.4)
Expand Down Expand Up @@ -425,8 +425,9 @@ GEM
websocket-extensions (0.1.5)
xpath (3.2.0)
nokogiri (~> 1.8)
yard (0.9.34)
zeitwerk (2.6.11)
yard (0.9.36)
yomu (0.1.5)
zeitwerk (2.6.13)

PLATFORMS
aarch64-linux-musl
Expand Down Expand Up @@ -475,6 +476,7 @@ DEPENDENCIES
webdrivers
webmock
yard
yomu

RUBY VERSION
ruby 3.2.2p53
Expand Down
3 changes: 2 additions & 1 deletion app/controllers/extraction_definitions_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ def find_destinations
def extraction_definition_params
safe_params = params.require(:extraction_definition).permit(
:pipeline_id, :name, :format, :base_url, :throttle, :page, :per_page,
:total_selector, :kind, :destination_id, :source_id, :enrichment_url, :paginated, :split, :split_selector
:total_selector, :kind, :destination_id, :source_id, :enrichment_url, :paginated, :split, :split_selector,
:extract_text_from_file
)
merge_last_edited_by(safe_params)
end
Expand Down
11 changes: 6 additions & 5 deletions app/frontend/js/apps/ExtractionApp/components/HeaderActions.jsx
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,12 @@ const HeaderActions = () => {

return createPortal(
<>
{!appDetails.extractionDefinition.split && (
<button className="btn btn-success me-2" onClick={handlePreviewClick}>
<i className="bi bi-play" aria-hidden="true"></i> Preview
</button>
)}
{!appDetails.extractionDefinition.split &&
!appDetails.extractionDefinition.extract_text_from_file && (
<button className="btn btn-success me-2" onClick={handlePreviewClick}>
<i className="bi bi-play" aria-hidden="true"></i> Preview
</button>
)}

{appDetails.extractionDefinition.split && <RunSample />}

Expand Down
1 change: 1 addition & 0 deletions app/models/extraction_definition.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ class ExtractionDefinition < ApplicationRecord

validates :name, uniqueness: true
validates :split_selector, presence: true, if: :split?
validates :s3_bucket, presence: true, if: :s3?

validates :throttle, numericality: { only_integer: true, greater_than_or_equal_to: 0, less_than_or_equal_to: 60_000 }

Expand Down
1 change: 1 addition & 0 deletions app/models/harvest_report.rb
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ def statuses
def idle_offset
return 0 if extraction_end_time.blank?
return @idle_offset if @idle_offset.present?
return 0 if transformation_start_time.blank? || extraction_end_time.blank?

@idle_offset = transformation_start_time - extraction_end_time
@idle_offset = 0 if @idle_offset.negative?
Expand Down
Loading
Loading