Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] RayClusterStatusConditions Tests Review #2645

Closed
rueian opened this issue Dec 13, 2024 · 4 comments
Closed

[Umbrella] RayClusterStatusConditions Tests Review #2645

rueian opened this issue Dec 13, 2024 · 4 comments
Assignees

Comments

@rueian
Copy link
Contributor

rueian commented Dec 13, 2024

Continue on #2562: We have 5 status conditions on RayCluster CR which are gated by the RayClusterStatusConditions feature gate in v1.2.0. Now, we want to make them enable by default in v1.3.0. This issue is an umbrella for reviewing the tests for these status conditions.

  • HeadPodReady
  • RayClusterProvisioned
  • ReplicaFailure
  • RayClusterSuspending
  • RayClusterSuspended

HeadPodReady and RayClusterProvisioned

In the envtest

It("Should handle HeadPodReady and RayClusterProvisioned conditions correctly", func(ctx SpecContext) {

  1. We make sure that the HeadPodReady condition is set when the Head Pod is ready.

    By("Update the head pod to Running and Ready")
    headPod.Status.Phase = corev1.PodRunning
    headPod.Status.Conditions = []corev1.PodCondition{
    {
    Type: corev1.PodReady,
    Status: corev1.ConditionTrue,
    },
    }
    Expect(k8sClient.Status().Update(ctx, &headPod)).Should(Succeed())
    By("Check RayCluster HeadPodReady condition is true")
    // The head pod is ready, so HeadPodReady condition should be added and set to True.
    Eventually(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.HeadPodReady), metav1.ConditionTrue)
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())

  2. We make sure that the RayClusterProvisioned condition is NOT set when Worker Pods are not ready but set after they are ready.

    By("Check RayCluster RayClusterProvisioned condition is false")
    // But the worker pod is not ready yet, RayClusterProvisioned condition should be false.
    Consistently(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionFalse(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned))
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())
    By("Update the worker pod to Running")
    workerPod.Status.Phase = corev1.PodRunning
    workerPod.Status.Conditions = []corev1.PodCondition{
    {
    Type: corev1.PodReady,
    Status: corev1.ConditionTrue,
    },
    }
    Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(Succeed())
    By("Check RayCluster RayClusterProvisioned condition is true")
    // All Ray Pods are ready for the first time, RayClusterProvisioned condition should be added and set to True.
    Eventually(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned), metav1.ConditionTrue)
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())

  3. We make sure that the RayClusterProvisioned condition is STILL set after workers are down.

    By("Update the worker pod to NotReady")
    workerPod.Status.Conditions = []corev1.PodCondition{
    {
    Type: corev1.PodReady,
    Status: corev1.ConditionFalse,
    },
    }
    Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(Succeed())
    By("Check RayCluster RayClusterProvisioned condition is true")
    // The worker pod fails readiness, but since RayClusterProvisioned focuses solely on whether all Ray Pods are ready for the first time,
    // RayClusterProvisioned condition should still be True.
    Consistently(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned), metav1.ConditionTrue)
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())

  4. We make sure that the RayClusterProvisioned condition is STILL set but the HeadPodReady condition is NOT set after Head Pod is dead.

    By("Update the head pod to NotReady")
    headPod.Status.Conditions = []corev1.PodCondition{
    {
    Type: corev1.PodReady,
    Status: corev1.ConditionFalse,
    },
    }
    Expect(k8sClient.Status().Update(ctx, &headPod)).Should(Succeed())
    By("Check RayCluster HeadPodReady condition is false")
    // The head pod is not ready, so HeadPodReady should be false.
    Eventually(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.HeadPodReady), metav1.ConditionFalse)
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())
    By("Check RayCluster RayClusterProvisioned condition is still true")
    // The head pod also fails readiness, RayClusterProvisioned condition not changed.
    Eventually(
    func() bool {
    if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
    return false
    }
    return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned), metav1.ConditionTrue)
    },
    time.Second*3, time.Millisecond*500).Should(BeTrue())

ReplicaFailure

In the envtest, we make sure the ReplicaFailure condition is set when Pods can't be created.

It("Should handle RayClusterReplicaFailure condition correctly", func(ctx SpecContext) {
namespace := "default"
rayCluster := rayClusterTemplate("raycluster-status-conditions-enabled-invalid", namespace)
rayCluster.Spec.HeadGroupSpec.Template.Spec.Containers[0].ImagePullPolicy = "!invalid!"
By("Create an invalid RayCluster custom resource")
err := k8sClient.Create(ctx, rayCluster)
Expect(err).NotTo(HaveOccurred(), "Failed to create RayCluster")
Eventually(
getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster),
time.Second*3, time.Millisecond*500).Should(BeNil(), "Should be able to see RayCluster: %v", rayCluster.Name)
By("Check RayCluster RayClusterReplicaFailure condition is true")
Eventually(
func() bool {
if err := getResourceFunc(ctx, client.ObjectKey{Name: rayCluster.Name, Namespace: namespace}, rayCluster)(); err != nil {
return false
}
return meta.IsStatusConditionPresentAndEqual(rayCluster.Status.Conditions, string(rayv1.RayClusterReplicaFailure), metav1.ConditionTrue)
},
time.Second*3, time.Millisecond*500).Should(BeTrue())
})

In the unit test, we make sure the calculateStatus function will set the ReplicaFailure condition from the reconcileErr.

// Test reconcilePodsErr with the feature gate enabled
newInstance, err = r.calculateStatus(ctx, testRayCluster, errors.Join(utils.ErrFailedCreateHeadPod, errors.New("invalid")))
assert.Nil(t, err)
assert.True(t, meta.IsStatusConditionPresentAndEqual(newInstance.Status.Conditions, string(rayv1.RayClusterReplicaFailure), metav1.ConditionTrue))

RayClusterSuspending

In the envtest

Describe("Suspend RayCluster atomically with Condition", Ordered, func() {

  1. We make sure the RayClusterSuspending condition is set when .Spec.Suspend is true.
    It("Should turn on the RayClusterSuspending if we set `.Spec.Suspend` back to true", func() {
    // suspend a Raycluster.
    err := updateRayClusterSuspendField(ctx, rayCluster, true)
    Expect(err).NotTo(HaveOccurred(), "Failed to update RayCluster")
    Eventually(findRayClusterSuspendStatus, time.Second*3, time.Millisecond*500).
    WithArguments(ctx, rayCluster).Should(Equal(rayv1.RayClusterSuspending))
    })
  2. We make sure the RayClusterSuspending condition is STILL set when .Spec.Suspend turns to false before finishing suspension.
    It("Should keep RayClusterSuspending consistently if we set `.Spec.Suspend` back to false", func() {
    err := updateRayClusterSuspendField(ctx, rayCluster, false)
    Expect(err).NotTo(HaveOccurred(), "Failed to update RayCluster")
    Consistently(findRayClusterSuspendStatus, time.Second*3, time.Millisecond*500).
    WithArguments(ctx, rayCluster).Should(Equal(rayv1.RayClusterSuspending))
    })

RayClusterSuspended

  1. We make sure the RayClusterSuspended condition is set and RayClusterProvisioned is unset when all Pods are gone.
    It("RayCluster's .status.state should be updated to 'suspended' shortly after all Pods are terminated", func() {
    Eventually(
    getClusterState(ctx, namespace, rayCluster.Name),
    time.Second*3, time.Millisecond*500).Should(Equal(rayv1.Suspended))
    if !withConditionDisabled {
    Eventually(findRayClusterSuspendStatus, time.Second*3, time.Millisecond*500).
    WithArguments(ctx, rayCluster).Should(Equal(rayv1.RayClusterSuspended))
    Expect(meta.IsStatusConditionTrue(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned))).To(BeFalse())
    // rayCluster.Status.Head.PodName will be cleared.
    // rayCluster.Status.Head.PodIP will also be cleared, but we don't test it here since we don't have IPs in tests.
    Expect(rayCluster.Status.Head.PodName).To(BeEmpty())
    }
    })
  2. We make sure the RayClusterSuspended condition is unset after resume the RayCluster.
    It("RayCluster's .status.state should be updated back to 'ready' after being un-suspended", func() {
    Eventually(
    getClusterState(ctx, namespace, rayCluster.Name),
    time.Second*3, time.Millisecond*500).Should(Equal(rayv1.Ready))
    if !withConditionDisabled {
    Eventually(findRayClusterSuspendStatus, time.Second*3, time.Millisecond*500).
    WithArguments(ctx, rayCluster).Should(BeEmpty())
    Expect(meta.IsStatusConditionTrue(rayCluster.Status.Conditions, string(rayv1.RayClusterProvisioned))).To(BeTrue())
    // rayCluster.Status.Head.PodName should have a value now.
    // rayCluster.Status.Head.PodIP should also have a value now, but we don't test it here since we don't have IPs in tests.
    Expect(rayCluster.Status.Head.PodName).NotTo(BeEmpty())
    }
    })
@andrewsykim
Copy link
Collaborator

Do we have any e2e tests for this feature? I'm guessing we never added it since our e2e tests don't make it easy to enable feature gates. But this should be simpler now that the feature gate is always enabled

@rueian
Copy link
Contributor Author

rueian commented Dec 17, 2024

We don't have explicit e2e tests for this feature yet. We do have implicit ones which are e2e tests for RayService since we check the new status condition in rayservice_controller.go:

if features.Enabled(features.RayClusterStatusConditions) {
if !meta.IsStatusConditionTrue(rayClusterInstance.Status.Conditions, string(rayv1.HeadPodReady)) {
logger.Info("The head Pod is not ready, requeue the resource event to avoid redundant custom resource status updates.")
return false, nil
}
} else {

To test them explicitly, how about we add some assertions to these condition values at sampleyaml/raycluster_test.go?

@rueian rueian changed the title [Umbrella][WIP] RayClusterStatusConditions Tests Review [Umbrella] RayClusterStatusConditions Tests Review Dec 18, 2024
@kevin85421
Copy link
Member

@rueian, do we need any additional tests after #2661 has been merged?

@rueian
Copy link
Contributor Author

rueian commented Dec 18, 2024

I think we could have some tests on the RayClusterSuspending and RayClusterSuspended conditions. The #2661 doesn't include them. But testing RayClusterSuspending in e2e sampleyamls might be hard without flaky unless we find a easy way to inject a custom finalizer to prevent deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants