Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multi-Job][3/N] Distinguish cross_silo msg with the job name #172

Merged
merged 17 commits into from
Aug 15, 2023

Conversation

NKcqx
Copy link
Collaborator

@NKcqx NKcqx commented Aug 10, 2023

Multi-job scenario may have case that job 1 send messages to the proxy actor that belongs to job2.
For example, there're two jobs [Job 1] and [Job 2] that both involve party 'alice' and 'bob', a possible scenario is:

  1. [Alice] driver submits Job1 sending messages to bob's receiver proxy and wait for response;
  2. [Bob] for some reason, bob fails to starts the receiver proxy and then driver exit;
  3. [Bob] driver submits Job2 where receiver proxy is listening on the same address
  4. [Bob] Receiver proxy receives alice message which belongs to Job1, and crash.

Therefore, the cross-silo message need a job_name to distinguish. In this case, the step 4 will ignore the message.

@NKcqx NKcqx added enhancement New feature or request p0 labels Aug 10, 2023
@NKcqx NKcqx added this to the release0.1.1 milestone Aug 10, 2023
@NKcqx NKcqx requested review from fengsp and a team August 10, 2023 08:40
paer added 3 commits August 10, 2023 16:44
def wrap_kv_key(job_name, key):
"""Add an prefix to the key to avoid conflict with other jobs.
"""
if (isinstance(key, bytes)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us just restrict the type to str only?

Copy link
Collaborator Author

@NKcqx NKcqx Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't restrict the KV get and put to only use the str type key.

Do you mean changing all of the current KV accesses to use str type only ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant:

def wrap_kv_key(job_name, key):
    """Add an prefix to the key to avoid conflict with other jobs.
    """
    assert isinstance(key, str)

fed/api.py Outdated Show resolved Hide resolved
@@ -48,23 +48,25 @@ def cross_silo_comm_config_dict(self) -> Dict:
_job_config = None


def get_cluster_config():
def get_cluster_config(job_name: str = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so this method name should be changed to get_job_config or get_config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "cluster_config" and "job_config" are part of the JOB, see #156. So I think it's fine to retrieve a "cluster config" by "job name" though it's definitely hard to understand 🤣

I think in the near future, we can merge these two configs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then let us leave a TODO comment on that target?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

fed/grpc/pb3/__init__.py Show resolved Hide resolved

async def SendData(self, request, context):
job_name = request.job_name
if job_name != self._job_name:
return fed_pb2.SendDataResponse(result="ERROR")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messag should be more detailed?

tests/test_transport_proxy.py Outdated Show resolved Hide resolved
tests/multi-jobs/test_job_msg_ignore.py Outdated Show resolved Hide resolved
@NKcqx NKcqx changed the title [Multi-Job][2/N] Distinguish cross_silo msg with the job name [Multi-Job][3/N] Distinguish cross_silo msg with the job name Aug 11, 2023
NKcqx and others added 3 commits August 11, 2023 14:24
Comment on lines +289 to +290
return fed_pb2.SendDataResponse(
result=f"JobName mis-match, expected {self._job_name}, got {job_name}.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The none "OK" response indicates that

  1. the message is sent successfully
  2. there're errors in receiver party, i.e. it's a cross-silo error.

Therefore, the following process should belong to the cross-silo error handle mechanism.

Copy link
Collaborator

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -46,7 +46,10 @@ def __init__(
submit_ray_task_func,
options={},
) -> None:
self._party = fed_config.get_cluster_config().current_party
# Note(NKcqx): FedCallHolder will only be created in driver process, where
# the GlobalContext must has been initialized.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note.

@NKcqx NKcqx merged commit f0ccfb3 into main Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants