Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about generating a digest value during sftp transfer #598

Open
drimmeer opened this issue Oct 1, 2024 · 18 comments
Open

Question about generating a digest value during sftp transfer #598

drimmeer opened this issue Oct 1, 2024 · 18 comments

Comments

@drimmeer
Copy link

drimmeer commented Oct 1, 2024

Please advice on how to generate a digest (specifically, the Sha256 value) of the file being transferred via this sftp tool?
The counterpart in Java is like this:

		MessageDigest messageDigestSha256=MessageDigest.getInstance( "SHA-256");
			while ((bytesRead = inStream.read(dataBytes)) != -1) {
				outStream.write(dataBytes, 0, bytesRead);
				messageDigestSha256.update(dataBytes, 0, bytesRead); // generate digest
			}
		String EncodedDigestSha256 = Hex.encodeHexString(messageDigestSha256.digest());
...

I know that to generate the Sha256 of a local file, the Go code would be like this:

			file, err := os.Open("file.txt")
			hash := sha256.New()
			_, err := io.Copy(hash, file)
			hashBytes := hash.Sum(nil)

But how to get that during the sftp transfer?

BTW, currently I wrote code like below to do file transfer (error handling being removed just to show the major logic):

	fDestination, err := sftpClient.Create(destNameRemote)
	fSource, err := os.Open(srcNameLocal)
	_, err = io.Copy(fDestination, fSource)

or

	fSource, err := sftpClient.Open(srcNameRemote)
	fDestination, err := os.Create(destNameLocal)
	_, err = io.Copy(fDestination, fSource)

Thank you in advance.

@lespea
Copy link

lespea commented Oct 1, 2024

@puellanivis
Copy link
Collaborator

While using wrapio.NewHashReader() would definitely make it easy enough to get both a hash and the file at the same time, it would however break any ReadFrom or WriteTo that we implement, which is usually how we can ensure the best performance, rather than reading individual blocks and then waiting for the response. If you’re using WithConcurrentReads(true) with v1 of the API, or if you’re using v2 of the API, then you will see significantly reduced performance with a wrapped IO than you would get by downloading the file locally, and then generating the SHA hash after it has arrived.

@lespea
Copy link

lespea commented Oct 2, 2024

Good to know, thanks for the tip!

@drimmeer
Copy link
Author

drimmeer commented Oct 2, 2024

Thank you Cassondra very much for the tips.

We are indeed relying on the WithConcurrentReads(true) for the required performance, cause we might transfer hundreds of GBs, or even TBs.

However, we also need a way to get both a hash and the file at the same time, because the source is not a physical file, but a pipe, while the destination is a remote file. So there is no way to generate the SHA out of a local file, but only a remote file, which will double the transfer data then!

Do you have any suggestion on how we could overcome this? Is it possible to wrap the ReadFrom() or WriteTo() that you implement to get hash out of them?

@drimmeer
Copy link
Author

drimmeer commented Oct 2, 2024

Is it possible to do in this way while performance of your code won't be affected?

fSource, err := os.Open(pipeNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)

s := sha256.New()
hr := NewHashReader(s, fSource)

io.Copy(fDestination, hr)
// Use the Sum()s which in this case we'll just print it out.
fmt.Println(hex.EncodeToString(s.Sum(nil)))

@drimmeer
Copy link
Author

drimmeer commented Oct 3, 2024

I tested the above code and compared it with other approach, and confirmed that it has significantly reduced performance.
Here is the benchmark:

  1. To transfer a 900M file (not a pipe) from z/OS to Linux, the above code took 14~16 minutes;
  2. While without any sha256 hash calculation, it took only 1 minute;
  3. Then I tried calculating the hash at the beginning, it took only 1 minute for transfer, and 1~6 seconds for hash calculation)

But as I said, we cannot do like #3 above, because in our case the local file is a pipe.
What can I do then?

@puellanivis
Copy link
Collaborator

I think, yeah, you should be able to wrap the ReadFrom or WriteTo functions on your own to perform a double write, once to a hash, then to the underlying writer.

Hm, you might also be able to use a NewHashWriter to wrap the local file operation, and still get the WriteTo functionality from this package?

@drimmeer
Copy link
Author

drimmeer commented Oct 3, 2024

Thanks Cassondra for the confirmation.
But as seen from my testing above, using the NewHashReader significantly reduced the performance. NewHashReader is also a wrapper to ReadFrom(), am I not right? What could be different if I do the wrapping by my own code?

Or you meant using NewHashWriter won't affect the performance as using NewHashReader does?

@puellanivis
Copy link
Collaborator

Yeah, I meant using NewHashWriter as that would allow our WriteTo to still work.

@drimmeer
Copy link
Author

drimmeer commented Oct 4, 2024

Hi Cassondra,
I tried wrapping the writer by my own but it also was very slow. Here is how I did it:

I created a wrapper like this:

// sha256_writer.go implements the io.Writer interface.
type sha256_writer struct {
	sha256_calculator hash.Hash
	origin_writer     io.Writer
}

// Write implements the io.Writer interface.
func (this_writer *sha256_writer) Write(write_out_buffer []byte) (int, error) {
	this_writer.sha256_calculator.Write(write_out_buffer)
	return this_writer.origin_writer.Write(write_out_buffer)
}

func NewSha256Writer(writer_to_be_wrap io.Writer) *sha256_writer {
	return &sha256_writer{sha256_calculator: sha256.New(), origin_writer: writer_to_be_wrap}
}

func (this_writer *sha256_writer) GetHashSum() []byte {
	return this_writer.sha256_calculator.Sum(nil)
}

Then I used it like this:

	fSource, err := os.Open(srcNameLocal)
	fDestination, err := sftpClient.Create(destNameRemote)
	hashWriter := NewSha256Writer(fDestination)
	_, err = io.Copy( hashWriter, fSource)
	log.Println("File sent with sha256 checksum: " + hex.EncodeToString(hashWriter.GetHashSum()))

But it still took 12 minutes to transfer that 900M file, while it took only 1 minute without the wrapper...

@drimmeer
Copy link
Author

drimmeer commented Oct 4, 2024

Using the NewHashWriter like this is even slower, taking 14 minutes to tranfer.


	fSource, err := os.Open(srcNameLocal)
	fDestination, err := sftpClient.Create(destNameRemote)
	sha256 := sha256.New()
	hashWriter := wrapio.NewHashWriter(sha256, fDestination)
	_, err = io.Copy( hashWriter, fSource)
	log.Println("File sent with sha256 checksum: " + hex.EncodeToString(sha256.Sum(nil)))

@drimmeer
Copy link
Author

drimmeer commented Oct 6, 2024

Good news!
I finally got what 'WriteTo' mean by you, and wrapped my reader around os.File instead of io.Reader. That way it worked perfectly! The performance was not dropped anymore!

Thank you for the help!

@puellanivis
Copy link
Collaborator

So, yeah, either way the io.Copy goes, you want to ensure that you’re wrapping the other reader/writer from our sftp.File object. This will ensure the sftp implementation of either WriteTo or ReadFrom will be used.

@lespea
Copy link

lespea commented Nov 17, 2024

I played around with this some more and I actually got the best performance with this wrapper, the key being to impl the Size() int64 method.

type HashReader struct {
	size int64
	r    io.ReadCloser
	h    hash.Hash
}

func NewHashReader(r io.ReadCloser, size int64, h hash.Hash) HashReader {
	return HashReader{size, r, h}
}

func NewHashReaderFile(path string, h hash.Hash) (HashReader, error) {
	fh, err := os.Open(path)
	if err != nil {
		return HashReader{}, err
	}

	if stat, err := fh.Stat(); err != nil {
		return HashReader{}, err
	} else {
		return NewHashReader(fh, stat.Size(), h), nil
	}
}

func (ha HashReader) Read(p []byte) (int, error) {
	n, err := ha.r.Read(p)
	if n > 0 {
		_, _ = ha.h.Write(p[:n])
	}
	return n, err
}

func (ha HashReader) Hash() []byte {
	return ha.h.Sum(nil)
}

func (ha HashReader) HashStr() string {
	return hex.EncodeToString(ha.Hash())
}

func (ha HashReader) Size() int64 {
	return ha.size
}

func (ha HashReader) Close() error {
	return ha.r.Close()
}

Then just make sure you use fh.ReadFrom(wrapper) where fh is an sftp.File

@lespea
Copy link

lespea commented Nov 17, 2024

Also do not wrap the os.File object with bufio that also tanks performance.

@puellanivis
Copy link
Collaborator

I strongly recommend making your HashReader a pointer. Since it contains interfaces, it’s functionality is already non-value-like, and mixed value-like and reference-like behavior can lead to very unexpected and difficult to debug bugs.

You also might want to use the WriteTo or ReadFrom calls on the sftp.File directly, which will ensure you get the SFTP implementation, and not the other-side implementation. (os.File’s WriteTo is faster than a naïve read to buffer write from buffer, but it doesn’t put multiple reads/writes in flight to the SFTP file.)

@lespea
Copy link

lespea commented Nov 17, 2024

I actually started with it as a pointer but making it a non-pointer lead to a consistent 30-40 mb/s increase in transfer speed (which surprised me I only tried it on a whim). And yeah it was necessary to use ReadFrom to get the best speed.

@puellanivis
Copy link
Collaborator

Huh… interesting. That’s something I wouldn’t have expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants