-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export cache image and app image in parallel #1167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like this and was also thinking about proposing this recently. The only risk I see is that logs for both internal steps will mix, but that hardly seems worth doing.
To answer your ? earlier - Layer reuse won't be affected. The cache is an image with just a single layer and not the same layers that the run image uses. If I'm understanding your question correctly 😄
I'm +1 on this, but I want to wait for @natalieparellano @joe-kimmel-vmw to weigh in as well. Unsure if we would need/want to guard this by platform version or any other mechanism. My vote would be no, we don't guarantee logs or their order as part of any API.
Thanks for your contribution @ESWZY ! |
LGTM; I'm +1 on using |
Thanks for your comment! It is reasonable to add tests for the parallelized export process. But I found that most of the existing tests are to test the two steps of But the modified parallel logic is in between, should I reuse the existing testing logic? I found some test cases here that I feel are similar, but not sure if I should create a new test case based on this, or just reuse this case and modify it: lifecycle/acceptance/exporter_test.go Lines 294 to 318 in f8b3419
// Add these lines to detect whether the export of the two is successful
h.Run(t, exec.Command("docker", "pull", cacheImageName))
h.Run(t, exec.Command("docker", "pull", exportedImageName)) |
Since you are just using the go-subroutines this is not a big change in terms of logic; so it would be fine to just validate that those steps are completed successfully as expected so I guess we can modify the existing ones asserting that everything went well, thanks 😄 |
@ESWZY is this ready for a re-review? |
@natalieparellano @dlion Sorry for no reply for a long time. I was trying to design a test case for the parallel export process, but had no idea how to do that. The original test cases have already included the export tests of single app image, single cache image and both of them. These test cases have been able to part of prove that the parallel export process works normally. Should we add some more complex, or more intensive export tests? Or just use the original test cases without any modification. Thx. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! @jabrown85 would you like to have another look?
688c9bf
to
75fd8f0
Compare
I've been thinking about this more and I think we need to check that |
Looking into the code a little bit I think we'll want to put a mutex around Edit: we won't want to lock all of |
I think we can make Line 45 in df9ae90
|
Signed-off-by: Woa <[email protected]>
Good idea! I changed to use |
@ESWZY @natalieparellano @jabrown85 I am thinking about this and do we really need this change? Instead, I have this proposal (and this could be optional on platform side based on some input from platform indicating that platform wants/expects this behavior) - If lifecycle/exporter could write some status like "APP_IMAGE_READY" (in the context of corresponding build/execution) as soon as app image is ready[1] (w/o waiting for cache image export) to CNB_PLATFORM_DIR from which our platform which can be relayed to interested services who can then immediately deploy app image and don't necessarily need to wait for cache image export. ... [1] https://github.com/buildpacks/lifecycle/blob/main/cmd/lifecycle/exporter.go#L227 @ESWZY Would implementing this proposal help your use case as well? |
@kritkasahni-google I like the idea of messaging with the platform in an async fashion. Maybe something like Essentially platform hooks. We could formalize a hook concept if we thought that was better. Similar to git hooks, where we execute
What do you think would be best for your platform @kritkasahni-google? lifecycle written files or executables? |
Another option would be to look for the presence of |
@kritkasahni-google Actually, our team also need to control the export behavior. As you can see in the PR description, there is a few seconds improvement overall, but the app export process does slow down by a few seconds. If there is a trigger mechanism in image registry, then deployment steps can be executed subsequently. For some scenes that are not covered, then we need this kind of concurrent export. So, I also agree that there should be a platform action to specify whether parallelism is enabled. Is such a design feasible? @natalieparellano @joe-kimmel-vmw |
@ESWZY we could add that! It would require an RFC: https://github.com/buildpacks/rfcs#rfc-process Is that something you'd be willing to contribute? We could guide you through the process if that would be helpful. |
@ESWZY Makes sense that parallelism could be enabled/disabled based on input from platform. If we could keep this behind a flag I would be maybe interested in trying out if/how parallelism helps us, based on perf gains or not we could then easily disable it using that flag. |
I would like to! Let me read RFC 0004 first. 🥰 |
@jabrown85 Let me get back to you in a bit about this - I am also discussing with our platform folks atm about this. |
@jabrown85 Our platform team has tabled this for now - it is hard to say when it will get prioritized. But I would be interested in making this change provided there are more users asking for this. Let me open a new issue to track this, what do you think? |
Blocking on buildpacks/rfcs#291 |
16502eb
to
c3d65c4
Compare
Signed-off-by: Woa <[email protected]>
c3d65c4
to
c0c421b
Compare
Opened #1215 to track messaging with platform in async fashion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ESWZY thank you for pushing this forward. I added a couple of comments. Would you be willing to make the spec PR also? This could go out in Platform 0.13.
if c.ParallelExport { | ||
if c.CacheImageRef == "" { | ||
cmd.DefaultLogger.Warn("parallel export has been enabled, but it has not taken effect because cache image (-cache-image) has not been specified.") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this validation in the platform
package? Somewhere in ResolveInputs
? Our eventual aim is to move all such validations there. That would also have the nice side effect of printing the warning when cmd/lifecycle/exporter
is invoked with this configuration.
} | ||
if sha, ok := f.tarHashes[tarPath]; ok { | ||
f.Logger.Debugf("Reusing tarball for layer %q with SHA: %s\n", id, sha) | ||
if sha, ok := f.tarHashes.Load(tarPath); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about it, and I think we could still have a race condition if this load returns !ok
. We could end up processing the same tar path in parallel - exporter and cacher each reading all the bits in the layer before one of them stores the result. What about something like:
const processing = "processing"
func (f *Factory) writeLayer(id, createdBy string, addEntries func(tw *archive.NormalizingTarWriter) error) (layer Layer, err error) {
tarPath := filepath.Join(f.ArtifactsDir, escape(id)+".tar")
var (
tries int
sha any
loaded bool
)
for {
sha, loaded = f.tarHashes.LoadOrStore(tarPath, processing)
if loaded {
shaString := sha.(string)
if shaString == processing {
// another goroutine is processing this layer, wait and try again
time.Sleep(time.Duration(tries) * 500 * time.Millisecond)
tries++
continue
}
f.Logger.Debugf("Reusing tarball for layer %q with SHA: %s\n", id, shaString)
return Layer{
ID: id,
TarPath: tarPath,
Digest: shaString,
History: v1.History{CreatedBy: createdBy},
}, nil
}
break
}
// function continues...
We could debate about the manner of the backoff but this seems preferable to potentially reading all the bits twice, especially for large layers which tend to take a lot of time anyway. @jabrown85 do you have any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jabrown85 wdyt?
I think - if not faster than processing the same layer again, backoff and retrying this way won't be at least slower. I will implement exponential backoff and retry. @natalieparellano @jabrown85 is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, do you think "exponential" backoff can cause it to be slower in some case? Ideally we need parallelism to make it faster and at least "exponential" backoff might be counter-productive in worst case and I think fixed time delay would be better here - how about 500ms or even 1 sec. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think starting with what @natalieparellano seems reasonable - we can always circle back and adjust any delay timings as we get feedback and real world cases.
RunImageRef: runImageID, | ||
RunImageForExport: runImageForExport, | ||
WorkingImage: appImage, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move encoding.WriteTOML(e.ReportPath, &report)
into this func as well? That would allow report.toml to be written before the cache has finished, which would allow platforms to use the presence of this file as a signal that the image is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense @natalieparellano - working on this change. 1 qq -> when parallel export is enabled and when go routine to export app image fails, should we cancel go routine to export cache image or wait for it to complete, wdyt?
@ESWZY if you are preoccupied with something else, I could take this change forward if thats okay with you and everyone? I am interested in trying out this change with our buildpacks as well. |
@kritkasahni-google That's okay, thank you for your contribution. |
Spec PR buildpacks/spec#380 |
In our scenario, the app image and the cache image need to be exported at the same time, but this process is serial in the lifecycle, which means that after the app image is exported, we have to wait for the cache image to be exported. After calculation, the time to export the app image is about the same as the time to export the cache, but we don’t need to wait for the export of cache image, only after the app is exported, we can proceed to the next steps (distribution and deployment).
So we tried to parallelize this step (this PR) and compare it with the serial exporting. We used several projects for testing and pushed app images and cache images to the same self-hosting registry.
Java (app image is 202.361MB, cache image is 157.525MB, with a same layer: 107.648MB):
Go (app image is 114.273MB, cache is 175.833MB, no same layer):
We can get some improvements here, although the export time of app image and cache image has significantly increased, the total time has decreased. The effect will be more pronounced if we export to different registry or use faster bandwidth.
But my confusion is, if the app image and cache image contain same layeres, will this method no longer resue layeres when pushing to the registry. Or we should detece / specify whether to use a parallel pushing strategy. Thx.