Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export cache image and app image in parallel #1167

Closed
wants to merge 2 commits into from

Conversation

ESWZY
Copy link

@ESWZY ESWZY commented Jul 30, 2023

In our scenario, the app image and the cache image need to be exported at the same time, but this process is serial in the lifecycle, which means that after the app image is exported, we have to wait for the cache image to be exported. After calculation, the time to export the app image is about the same as the time to export the cache, but we don’t need to wait for the export of cache image, only after the app is exported, we can proceed to the next steps (distribution and deployment).

So we tried to parallelize this step (this PR) and compare it with the serial exporting. We used several projects for testing and pushed app images and cache images to the same self-hosting registry.

  • Java (app image is 202.361MB, cache image is 157.525MB, with a same layer: 107.648MB):

    • Before: total 18.34s, app 8.96s, cache 9.38s
    • After: total 14.70s, app 11.42s, cache 13.93s
    • app image: 0+1.103MB+15.153MB+107.648MB+49.953MB+0+0+28.502MB
    • cache image: 9.411MB+40.465MB+107.648MB
  • Go (app image is 114.273MB, cache is 175.833MB, no same layer):

    • Before: total 16.57s, app 5.92s, cache 10.65s
    • After: total 12.02s, app 7.31s, cache 11.48s
    • app image: 0+1MB+25.72MB+8.993MB+49.953MB+0+0+28.502MB
    • cache image: 70.87MB+104.964MB

We can get some improvements here, although the export time of app image and cache image has significantly increased, the total time has decreased. The effect will be more pronounced if we export to different registry or use faster bandwidth.

But my confusion is, if the app image and cache image contain same layeres, will this method no longer resue layeres when pushing to the registry. Or we should detece / specify whether to use a parallel pushing strategy. Thx.

@ESWZY ESWZY requested a review from a team as a code owner July 30, 2023 05:29
@ESWZY ESWZY force-pushed the parallel-export branch from ba3365c to edfdf73 Compare July 30, 2023 05:30
Copy link
Contributor

@jabrown85 jabrown85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like this and was also thinking about proposing this recently. The only risk I see is that logs for both internal steps will mix, but that hardly seems worth doing.

To answer your ? earlier - Layer reuse won't be affected. The cache is an image with just a single layer and not the same layers that the run image uses. If I'm understanding your question correctly 😄

I'm +1 on this, but I want to wait for @natalieparellano @joe-kimmel-vmw to weigh in as well. Unsure if we would need/want to guard this by platform version or any other mechanism. My vote would be no, we don't guarantee logs or their order as part of any API.

cmd/lifecycle/exporter.go Outdated Show resolved Hide resolved
cmd/lifecycle/exporter.go Show resolved Hide resolved
@dlion
Copy link
Member

dlion commented Jul 31, 2023

Thanks for your contribution @ESWZY !
Would be awesome if you could add some unit tests to validate that your changes work as expected.
You can find utilities to test async code in the testhelpers.go file, specifically the helper function Eventually.

@joe-kimmel-vmw
Copy link
Contributor

LGTM; I'm +1 on using defer as Jesse suggested.

@ESWZY
Copy link
Author

ESWZY commented Aug 1, 2023

Thanks for your contribution @ESWZY ! Would be awesome if you could add some unit tests to validate that your changes work as expected. You can find utilities to test async code in the testhelpers.go file, specifically the helper function Eventually.

Thanks for your comment! It is reasonable to add tests for the parallelized export process. But I found that most of the existing tests are to test the two steps of func (e *Exporter) Export and func (e *Exporter) Cache; or to run a Docker container to test the entire export process.

But the modified parallel logic is in between, should I reuse the existing testing logic? I found some test cases here that I feel are similar, but not sure if I should create a new test case based on this, or just reuse this case and modify it:

when("cache image case", func() {
it("is created", func() {
cacheImageName := exportTest.RegRepoName("some-cache-image-" + h.RandString(10))
exportFlags := []string{"-cache-image", cacheImageName}
if api.MustParse(platformAPI).LessThan("0.7") {
exportFlags = append(exportFlags, "-run-image", exportRegFixtures.ReadOnlyRunImage)
}
exportArgs := append([]string{ctrPath(exporterPath)}, exportFlags...)
exportedImageName = exportTest.RegRepoName("some-exported-image-" + h.RandString(10))
exportArgs = append(exportArgs, exportedImageName)
output := h.DockerRun(t,
exportImage,
h.WithFlags(
"--env", "CNB_PLATFORM_API="+platformAPI,
"--env", "CNB_REGISTRY_AUTH="+exportRegAuthConfig,
"--network", exportRegNetwork,
),
h.WithArgs(exportArgs...),
)
h.AssertStringContains(t, output, "Saving "+exportedImageName)
h.Run(t, exec.Command("docker", "pull", exportedImageName))
assertImageOSAndArchAndCreatedAt(t, exportedImageName, exportTest, imgutil.NormalizedDateTime)

// Add these lines to detect whether the export of the two is successful
h.Run(t, exec.Command("docker", "pull", cacheImageName))
h.Run(t, exec.Command("docker", "pull", exportedImageName))

@ESWZY ESWZY force-pushed the parallel-export branch from edfdf73 to eccad2d Compare August 1, 2023 09:01
@dlion
Copy link
Member

dlion commented Aug 1, 2023

should I reuse the existing testing logic? I found some test cases here that I feel are similar, but not sure if I should create a new test case based on this, or just reuse this case and modify it.

Since you are just using the go-subroutines this is not a big change in terms of logic; so it would be fine to just validate that those steps are completed successfully as expected so I guess we can modify the existing ones asserting that everything went well, thanks 😄

@natalieparellano natalieparellano added this to the lifecycle 0.18.0 milestone Aug 3, 2023
@natalieparellano
Copy link
Member

@ESWZY is this ready for a re-review?

@ESWZY
Copy link
Author

ESWZY commented Aug 14, 2023

@ESWZY is this ready for a re-review?

@natalieparellano @dlion Sorry for no reply for a long time. I was trying to design a test case for the parallel export process, but had no idea how to do that.

The original test cases have already included the export tests of single app image, single cache image and both of them. These test cases have been able to part of prove that the parallel export process works normally. Should we add some more complex, or more intensive export tests? Or just use the original test cases without any modification. Thx.

Copy link
Member

@natalieparellano natalieparellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! @jabrown85 would you like to have another look?

@ESWZY ESWZY force-pushed the parallel-export branch 6 times, most recently from 688c9bf to 75fd8f0 Compare August 15, 2023 06:18
@natalieparellano
Copy link
Member

I've been thinking about this more and I think we need to check that launch = true cache = true layers are cached successfully when this is done in parallel. I fear the cacher may be relying on the app exporter for creating these layers.

@natalieparellano natalieparellano self-requested a review August 16, 2023 20:56
@natalieparellano
Copy link
Member

natalieparellano commented Aug 17, 2023

Looking into the code a little bit I think we'll want to put a mutex around e.LayerFactory.DirLayer so that we avoid re-creating the same layer.tar in parallel in the case that the exporter (while exporting) and the exporter (while caching) is processing the same layer.

Edit: we won't want to lock all of e.LayerFactory.DirLayer, just the processing of a particular fsLayer.Identifier()

@natalieparellano
Copy link
Member

we'll want to put a mutex around e.LayerFactory.DirLayer

I think we can make Factory.tarHashes a sync.Map and use LoadOrStore when we access the map here:

if sha, ok := f.tarHashes[tarPath]; ok {

@ESWZY
Copy link
Author

ESWZY commented Aug 18, 2023

we'll want to put a mutex around e.LayerFactory.DirLayer

I think we can make Factory.tarHashes a sync.Map and use LoadOrStore when we access the map here:

if sha, ok := f.tarHashes[tarPath]; ok {

Good idea! I changed to use sync.Map for storage. But I'm not sure how to use LoadOrStore. 👀

@kritkasahni-google
Copy link
Contributor

kritkasahni-google commented Aug 21, 2023

@ESWZY @natalieparellano @jabrown85 I am thinking about this and do we really need this change?
I am investigating build performance improvements for Google Cloud Functions and I really need access to app image as soon as its ready. If this change is adding few seconds to app image export maybe can we hold onto it?

Instead, I have this proposal (and this could be optional on platform side based on some input from platform indicating that platform wants/expects this behavior) -

If lifecycle/exporter could write some status like "APP_IMAGE_READY" (in the context of corresponding build/execution) as soon as app image is ready[1] (w/o waiting for cache image export) to CNB_PLATFORM_DIR from which our platform which can be relayed to interested services who can then immediately deploy app image and don't necessarily need to wait for cache image export. ... [1] https://github.com/buildpacks/lifecycle/blob/main/cmd/lifecycle/exporter.go#L227

@ESWZY Would implementing this proposal help your use case as well?

@jabrown85
Copy link
Contributor

@kritkasahni-google I like the idea of messaging with the platform in an async fashion. Maybe something like EXPORTED_APP_IMAGE_REF that contains the registry.blah/repo/blah@digest would useful if we like the file based pattern. A similar EXPORTED_CACHE_IMAGE_REF could be written for parity reasons. I guess the platform would watch for the files to be written by lifecycle and treat them as events for kicking off platform specific actions?

Essentially platform hooks. We could formalize a hook concept if we thought that was better. Similar to git hooks, where we execute CNB_PLATFORM_DIR/{hook} if it exists.

$CNB_PLATFORM_DIR/hooks/pre-{stage}
$CNB_PLATFORM_DIR/hooks/post-{stage}
$CNB_PLATFORM_DIR/hooks/exporter-image-exported
$CNB_PLATFORM_DIR/hooks/exporter-cache-exported
etc.

What do you think would be best for your platform @kritkasahni-google? lifecycle written files or executables?

@natalieparellano
Copy link
Member

natalieparellano commented Aug 21, 2023

Maybe something like EXPORTED_APP_IMAGE_REF that contains the registry.blah/repo/blah@digest would useful

Another option would be to look for the presence of <layers>/report.toml, which is currently written when the app image is done, but before we start processing the cache image

@ESWZY
Copy link
Author

ESWZY commented Aug 22, 2023

@ESWZY @natalieparellano @jabrown85 I am thinking about this and do we really need this change? I am investigating build performance improvements for Google Cloud Functions and I really need access to app image as soon as its ready. If this change is adding few seconds to app image export maybe can we hold onto it?

Instead, I have this proposal (and this could be optional on platform side based on some input from platform indicating that platform wants/expects this behavior) -

If lifecycle/exporter could write some status like "APP_IMAGE_READY" (in the context of corresponding build/execution) as soon as app image is ready[1] (w/o waiting for cache image export) to CNB_PLATFORM_DIR from which our platform which can be relayed to interested services who can then immediately deploy app image and don't necessarily need to wait for cache image export. ... [1] https://github.com/buildpacks/lifecycle/blob/main/cmd/lifecycle/exporter.go#L227

@ESWZY Would implementing this proposal help your use case as well?

@kritkasahni-google Actually, our team also need to control the export behavior. As you can see in the PR description, there is a few seconds improvement overall, but the app export process does slow down by a few seconds. If there is a trigger mechanism in image registry, then deployment steps can be executed subsequently. For some scenes that are not covered, then we need this kind of concurrent export.

So, I also agree that there should be a platform action to specify whether parallelism is enabled. Is such a design feasible? @natalieparellano @joe-kimmel-vmw

@natalieparellano
Copy link
Member

@ESWZY we could add that! It would require an RFC: https://github.com/buildpacks/rfcs#rfc-process

Is that something you'd be willing to contribute? We could guide you through the process if that would be helpful.

@kritkasahni-google
Copy link
Contributor

kritkasahni-google commented Aug 22, 2023

@ESWZY Makes sense that parallelism could be enabled/disabled based on input from platform. If we could keep this behind a flag I would be maybe interested in trying out if/how parallelism helps us, based on perf gains or not we could then easily disable it using that flag.

@ESWZY
Copy link
Author

ESWZY commented Aug 23, 2023

@ESWZY we could add that! It would require an RFC: https://github.com/buildpacks/rfcs#rfc-process

Is that something you'd be willing to contribute? We could guide you through the process if that would be helpful.

I would like to! Let me read RFC 0004 first. 🥰

@kritkasahni-google
Copy link
Contributor

kritkasahni-google commented Aug 23, 2023

@kritkasahni-google I like the idea of messaging with the platform in an async fashion. Maybe something like EXPORTED_APP_IMAGE_REF that contains the registry.blah/repo/blah@digest would useful if we like the file based pattern. A similar EXPORTED_CACHE_IMAGE_REF could be written for parity reasons. I guess the platform would watch for the files to be written by lifecycle and treat them as events for kicking off platform specific actions?

Essentially platform hooks. We could formalize a hook concept if we thought that was better. Similar to git hooks, where we execute CNB_PLATFORM_DIR/{hook} if it exists.

$CNB_PLATFORM_DIR/hooks/pre-{stage}
$CNB_PLATFORM_DIR/hooks/post-{stage}
$CNB_PLATFORM_DIR/hooks/exporter-image-exported
$CNB_PLATFORM_DIR/hooks/exporter-cache-exported
etc.

What do you think would be best for your platform @kritkasahni-google? lifecycle written files or executables?

@jabrown85 Let me get back to you in a bit about this - I am also discussing with our platform folks atm about this.

@kritkasahni-google
Copy link
Contributor

RFC for exporting images in parallel buildpacks/rfcs#291

@kritkasahni-google I like the idea of messaging with the platform in an async fashion. Maybe something like EXPORTED_APP_IMAGE_REF that contains the registry.blah/repo/blah@digest would useful if we like the file based pattern. A similar EXPORTED_CACHE_IMAGE_REF could be written for parity reasons. I guess the platform would watch for the files to be written by lifecycle and treat them as events for kicking off platform specific actions?
Essentially platform hooks. We could formalize a hook concept if we thought that was better. Similar to git hooks, where we execute CNB_PLATFORM_DIR/{hook} if it exists.

$CNB_PLATFORM_DIR/hooks/pre-{stage}
$CNB_PLATFORM_DIR/hooks/post-{stage}
$CNB_PLATFORM_DIR/hooks/exporter-image-exported
$CNB_PLATFORM_DIR/hooks/exporter-cache-exported
etc.

What do you think would be best for your platform @kritkasahni-google? lifecycle written files or executables?

@jabrown85 Let me get back to you in a bit about this - I am also discussing with our platform folks atm about this.

@jabrown85 Our platform team has tabled this for now - it is hard to say when it will get prioritized. But I would be interested in making this change provided there are more users asking for this. Let me open a new issue to track this, what do you think?

@natalieparellano
Copy link
Member

Blocking on buildpacks/rfcs#291

@ESWZY ESWZY force-pushed the parallel-export branch 2 times, most recently from 16502eb to c3d65c4 Compare September 21, 2023 17:35
@kritkasahni-google
Copy link
Contributor

kritkasahni-google commented Oct 9, 2023

RFC for exporting images in parallel buildpacks/rfcs#291

@kritkasahni-google I like the idea of messaging with the platform in an async fashion. Maybe something like EXPORTED_APP_IMAGE_REF that contains the registry.blah/repo/blah@digest would useful if we like the file based pattern. A similar EXPORTED_CACHE_IMAGE_REF could be written for parity reasons. I guess the platform would watch for the files to be written by lifecycle and treat them as events for kicking off platform specific actions?
Essentially platform hooks. We could formalize a hook concept if we thought that was better. Similar to git hooks, where we execute CNB_PLATFORM_DIR/{hook} if it exists.

$CNB_PLATFORM_DIR/hooks/pre-{stage}
$CNB_PLATFORM_DIR/hooks/post-{stage}
$CNB_PLATFORM_DIR/hooks/exporter-image-exported
$CNB_PLATFORM_DIR/hooks/exporter-cache-exported
etc.

What do you think would be best for your platform @kritkasahni-google? lifecycle written files or executables?

@jabrown85 Let me get back to you in a bit about this - I am also discussing with our platform folks atm about this.

@jabrown85 Our platform team has tabled this for now - it is hard to say when it will get prioritized. But I would be interested in making this change provided there are more users asking for this. Let me open a new issue to track this, what do you think?

Opened #1215 to track messaging with platform in async fashion

Copy link
Member

@natalieparellano natalieparellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ESWZY thank you for pushing this forward. I added a couple of comments. Would you be willing to make the spec PR also? This could go out in Platform 0.13.

Comment on lines +79 to +83
if c.ParallelExport {
if c.CacheImageRef == "" {
cmd.DefaultLogger.Warn("parallel export has been enabled, but it has not taken effect because cache image (-cache-image) has not been specified.")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this validation in the platform package? Somewhere in ResolveInputs? Our eventual aim is to move all such validations there. That would also have the nice side effect of printing the warning when cmd/lifecycle/exporter is invoked with this configuration.

}
if sha, ok := f.tarHashes[tarPath]; ok {
f.Logger.Debugf("Reusing tarball for layer %q with SHA: %s\n", id, sha)
if sha, ok := f.tarHashes.Load(tarPath); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about it, and I think we could still have a race condition if this load returns !ok. We could end up processing the same tar path in parallel - exporter and cacher each reading all the bits in the layer before one of them stores the result. What about something like:

const processing = "processing"

func (f *Factory) writeLayer(id, createdBy string, addEntries func(tw *archive.NormalizingTarWriter) error) (layer Layer, err error) {
	tarPath := filepath.Join(f.ArtifactsDir, escape(id)+".tar")
	var (
		tries int
		sha any
		loaded bool
	)
	for {
		sha, loaded = f.tarHashes.LoadOrStore(tarPath, processing)
		if loaded {
			shaString := sha.(string)
			if shaString == processing {
				// another goroutine is processing this layer, wait and try again
				time.Sleep(time.Duration(tries) * 500 * time.Millisecond)
				tries++
				continue
			}
			f.Logger.Debugf("Reusing tarball for layer %q with SHA: %s\n", id, shaString)
			return Layer{
				ID:      id,
				TarPath: tarPath,
				Digest:  shaString,
				History: v1.History{CreatedBy: createdBy},
			}, nil
		}
		break
	}
	// function continues...

We could debate about the manner of the backoff but this seems preferable to potentially reading all the bits twice, especially for large layers which tend to take a lot of time anyway. @jabrown85 do you have any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabrown85 wdyt?
I think - if not faster than processing the same layer again, backoff and retrying this way won't be at least slower. I will implement exponential backoff and retry. @natalieparellano @jabrown85 is that okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, do you think "exponential" backoff can cause it to be slower in some case? Ideally we need parallelism to make it faster and at least "exponential" backoff might be counter-productive in worst case and I think fixed time delay would be better here - how about 500ms or even 1 sec. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think starting with what @natalieparellano seems reasonable - we can always circle back and adjust any delay timings as we get feedback and real world cases.

RunImageRef: runImageID,
RunImageForExport: runImageForExport,
WorkingImage: appImage,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move encoding.WriteTOML(e.ReportPath, &report) into this func as well? That would allow report.toml to be written before the cache has finished, which would allow platforms to use the presence of this file as a signal that the image is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense @natalieparellano - working on this change. 1 qq -> when parallel export is enabled and when go routine to export app image fails, should we cancel go routine to export cache image or wait for it to complete, wdyt?

@kritkasahni-google
Copy link
Contributor

@ESWZY if you are preoccupied with something else, I could take this change forward if thats okay with you and everyone? I am interested in trying out this change with our buildpacks as well.

@ESWZY
Copy link
Author

ESWZY commented Nov 1, 2023

@ESWZY if you are preoccupied with something else, I could take this change forward if thats okay with you and everyone? I am interested in trying out this change with our buildpacks as well.

@kritkasahni-google That's okay, thank you for your contribution.

cc @natalieparellano

@kritkasahni-google
Copy link
Contributor

Spec PR buildpacks/spec#380

@natalieparellano
Copy link
Member

Superseded by #1247. Thanks again for your work @ESWZY!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants