Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3869 - Federal Restrictions - Unzip ZIP files coming from SFTP #4163

Merged

Conversation

dheepak-aot
Copy link
Collaborator

@dheepak-aot dheepak-aot commented Dec 20, 2024

Federal Restrictions - Unzip ZIP files coming from SFTP

  • Adjusted the existing regex to match the federal restrictions file with .zip and .ZIP format.
  • Used a library adm-zip to extract the .zip file.
  • Updated the sftp-integration-base to use encoding null while reading only the compressed file to avoid data corruption.
  • For the federal restrictions integration, the first file inside the downloaded compressed archive is processed assuming there will always be only one file in the .zip archive.

Technical Investigations and performance findings

APPROACH 1

Based on documentation and also testing, the nodejs in built library Zlib supports archiving and extraction of only gunzip (.gz) files.
It does not support the same operations on a .zip files.
Extracting .zip with Zlib Gunzip (Doesn't support)

image

image

Extracting .gz with Zlib Gunzip (Works Perfectly)

image

APPROACH 2 - Third party library

https://github.com/cthackers/adm-zip

Tested code(Not the final code)

image

It also provides non blocking method to read data. (getDataAsync)

It works perfectly.

image

Tested the upload with 139MB file with around 140,000 records.
Time taken by the lib to read the file is 666ms
image

@dheepak-aot dheepak-aot self-assigned this Dec 20, 2024
@dheepak-aot dheepak-aot marked this pull request as ready for review December 20, 2024 20:30
const fileSearch = new RegExp(
`^${this.esdcConfig.environmentCode}CSLS\\.PBC\\.RESTR\\.LIST\\.D[\\w]*\\.[\\d]*$`,
`^${this.esdcConfig.environmentCode}CSLS\\.PBC\\.RESTR\\.LIST\\.D[\\w]*\\.[\\d]*\\.(zip|ZIP)$`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enforcing the zip extension would allow the process to download the file either way, compressed or not.
I do not see an AC requesting either one or not and I do not see a reason to have it restricted.
Not a blocker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enforcing the .zip will also lead to not enforcing the end of file name by removing the existing $. I see the benefit of it working either way, but should e remove the $. ?

Let me know if I am missing something here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we talked, IMO, we do need to enforce the precise ending of the string but I am ok either way.

readStreamOptions: { encoding: FILE_DEFAULT_ENCODING },
});
let fileContent: string | NodeJS.WritableStream | Buffer;
const fileExtension = path.parse(remoteFilePath).ext.toLocaleLowerCase();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to use toLocaleLowerCase instead of the regular toLowerCase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a mistake 😂 intention was toLowerCase

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, the below was used in the past.

path.extname(file.originalname).toLowerCase()

Comment on lines 177 to 179
fileContent = await client.get(remoteFilePath, undefined, {
readStreamOptions: { encoding: FILE_DEFAULT_ENCODING },
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the compressed get at the line 167 the below can be cast.

 // Read all the file content and create a buffer with 'ascii' encoding.
fileContent = (await client.get(remoteFilePath, undefined, {
  readStreamOptions: { encoding: FILE_DEFAULT_ENCODING },
})) as string;

This would allow the fileContent to be declared as string (which really is) instead of let fileContent: string | NodeJS.WritableStream | Buffer; which can be misleading.

const zipFile = new AdmZip(compressedFileBuffer);
const [firstExtractedFile] = zipFile.getEntries();
if (!firstExtractedFile) {
throw new Error("No files found in zip file");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the period.


/**
* Reads the first extracted file from a compressed archive file.
* @param compressedFileBuffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter comment.

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, please take a look at the comments.

@guru-aot guru-aot self-requested a review December 20, 2024 23:02
const compressedFileContent = (await client.get(
remoteFilePath,
undefined,
{ readStreamOptions: { encoding: null } },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

Backend Unit Tests Coverage Report

Totals Coverage
Statements: 22.11% ( 3784 / 17118 )
Methods: 10.11% ( 219 / 2166 )
Lines: 25.5% ( 3279 / 12859 )
Branches: 13.66% ( 286 / 2093 )

Copy link

E2E Workflow Workers Coverage Report

Totals Coverage
Statements: 65.43% ( 583 / 891 )
Methods: 59.26% ( 64 / 108 )
Lines: 68.54% ( 464 / 677 )
Branches: 51.89% ( 55 / 106 )

Copy link

E2E Queue Consumers Coverage Report

Totals Coverage
Statements: 82.7% ( 1291 / 1561 )
Methods: 70.05% ( 131 / 187 )
Lines: 85.67% ( 1082 / 1263 )
Branches: 70.27% ( 78 / 111 )

Copy link

E2E SIMS API Coverage Report

Totals Coverage
Statements: 67.4% ( 5922 / 8787 )
Methods: 65.09% ( 729 / 1120 )
Lines: 71.33% ( 4647 / 6515 )
Branches: 47.4% ( 546 / 1152 )

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great research, great work, looks good 👍

Copy link
Collaborator

@guru-aot guru-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work @dheepak-aot

@dheepak-aot dheepak-aot added this pull request to the merge queue Dec 21, 2024
Merged via the queue into main with commit 054aa9e Dec 21, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants