We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.
> cp Dockerfile Dockerfile.torchx > torchx run -s local_docker dist.ddp -j 1x2 --script dlrm_main.py torchx 2024-08-05 13:26:15 INFO Tracker configurations: {} torchx 2024-08-05 13:26:15 INFO Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`... torchx 2024-08-05 13:26:15 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically. torchx 2024-08-05 13:26:15 INFO Workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` resolved to filesystem path `/proj/java-gpu/training/recommendation_v2/torchrec_dlrm` torchx 2024-08-05 13:26:16 INFO Building workspace docker image (this may take a while)... torchx 2024-08-05 13:26:16 INFO Step 1/7 : ARG FROM_IMAGE_NAME=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime torchx 2024-08-05 13:26:16 INFO Step 2/7 : FROM ${FROM_IMAGE_NAME} torchx 2024-08-05 13:26:16 INFO ---> 71eb2d092138 torchx 2024-08-05 13:26:16 INFO Step 3/7 : RUN apt-get -y update && apt-get -y install git torchx 2024-08-05 13:26:16 INFO ---> Using cache torchx 2024-08-05 13:26:16 INFO ---> 45eded198de2 torchx 2024-08-05 13:26:16 INFO Step 4/7 : WORKDIR /workspace/torchrec_dlrm torchx 2024-08-05 13:26:16 INFO ---> Using cache torchx 2024-08-05 13:26:16 INFO ---> 1b41a30dcd79 torchx 2024-08-05 13:26:16 INFO Step 5/7 : COPY . . torchx 2024-08-05 13:26:16 INFO ---> ae30b5f5e5a1 torchx 2024-08-05 13:26:16 INFO Step 6/7 : RUN pip install --no-cache-dir -r requirements.txt torchx 2024-08-05 13:26:16 INFO ---> Running in 3ef0c644fc38 ... torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 3ef0c644fc38 torchx 2024-08-05 13:27:02 INFO ---> addfe3ce01cb torchx 2024-08-05 13:27:02 INFO Step 7/7 : LABEL torchx.pytorch.org/version=0.7.0 torchx 2024-08-05 13:27:02 INFO ---> Running in 4e254643ce54 torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 4e254643ce54 torchx 2024-08-05 13:27:02 INFO ---> 861ee2a4e5d3 torchx 2024-08-05 13:27:02 INFO [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed torchx 2024-08-05 13:27:02 INFO Successfully built 861ee2a4e5d3 torchx 2024-08-05 13:27:02 INFO Built new image `sha256:861ee2a4e5d33dca93d9fe8847feccd4028d2e27c8f281654307aeec203452bd` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` for role[0]=dlrm_main. local_docker://torchx/dlrm_main-sbz7tbpcb2sqvd torchx 2024-08-05 13:27:03 INFO Waiting for the app to finish... dlrm_main/0 WARNING:torch.distributed.run: dlrm_main/0 ***************************************** dlrm_main/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. dlrm_main/0 ***************************************** dlrm_main/0 [0]: dlrm_main/0 [0]:A module that was compiled using NumPy 1.x cannot be run in dlrm_main/0 [0]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x dlrm_main/0 [0]:versions of NumPy, modules must be compiled with NumPy 2.0. dlrm_main/0 [0]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. dlrm_main/0 [0]: dlrm_main/0 [0]:If you are a user of the module, the easiest solution will be to dlrm_main/0 [0]:downgrade to 'numpy<2' or try to upgrade the affected module. dlrm_main/0 [0]:We expect that some modules will need time to support NumPy 2. dlrm_main/0 [0]: dlrm_main/0 [0]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module> dlrm_main/0 [0]: import torchmetrics as metrics dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics import functional # noqa: E402 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.data import select_topk, to_onehot dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module> dlrm_main/0 [0]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0") dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version dlrm_main/0 [0]: if not _module_available(package): dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available dlrm_main/0 [0]: module = import_module(module_names[0]) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module dlrm_main/0 [0]: return _bootstrap._gcd_import(name[level:], package, level) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module> dlrm_main/0 [0]: from torchvision import datasets, io, models, ops, transforms, utils dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module> dlrm_main/0 [0]: from . import detection, optical_flow, quantization, segmentation, video dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module> dlrm_main/0 [0]: from .faster_rcnn import * dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module> dlrm_main/0 [0]: from .anchor_utils import AnchorGenerator dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module> dlrm_main/0 [0]: class AnchorGenerator(nn.Module): dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator dlrm_main/0 [0]: device: torch.device = torch.device("cpu"), dlrm_main/0 [0]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.) dlrm_main/0 [0]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]: dlrm_main/0 [1]:A module that was compiled using NumPy 1.x cannot be run in dlrm_main/0 [1]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x dlrm_main/0 [1]:versions of NumPy, modules must be compiled with NumPy 2.0. dlrm_main/0 [1]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. dlrm_main/0 [1]: dlrm_main/0 [1]:If you are a user of the module, the easiest solution will be to dlrm_main/0 [1]:downgrade to 'numpy<2' or try to upgrade the affected module. dlrm_main/0 [1]:We expect that some modules will need time to support NumPy 2. dlrm_main/0 [1]: dlrm_main/0 [1]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module> dlrm_main/0 [1]: import torchmetrics as metrics dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics import functional # noqa: E402 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.data import select_topk, to_onehot dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module> dlrm_main/0 [1]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0") dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version dlrm_main/0 [1]: if not _module_available(package): dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available dlrm_main/0 [1]: module = import_module(module_names[0]) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module dlrm_main/0 [1]: return _bootstrap._gcd_import(name[level:], package, level) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module> dlrm_main/0 [1]: from torchvision import datasets, io, models, ops, transforms, utils dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module> dlrm_main/0 [1]: from . import detection, optical_flow, quantization, segmentation, video dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module> dlrm_main/0 [1]: from .faster_rcnn import * dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module> dlrm_main/0 [1]: from .anchor_utils import AnchorGenerator dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module> dlrm_main/0 [1]: class AnchorGenerator(nn.Module): dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator dlrm_main/0 [1]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.) dlrm_main/0 [1]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]:Traceback (most recent call last): dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module> dlrm_main/0 [1]: main(sys.argv[1:]) dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 813, in main dlrm_main/0 [1]: plan = planner.collective_plan( dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/planner/planners.py", line 177, in collective_plan dlrm_main/0 [1]: return invoke_on_rank_and_broadcast_result( dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/collective_utils.py", line 58, in invoke_on_rank_and_broadcast_result dlrm_main/0 [1]: dist.broadcast_object_list(object_list, rank, group=pg) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2106, in broadcast_object_list dlrm_main/0 [1]: object_list[i] = _tensor_to_object(obj_view, obj_size) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in _tensor_to_object dlrm_main/0 [1]: buf = tensor.numpy().tobytes()[:tensor_size] dlrm_main/0 [1]:RuntimeError: Numpy is not available dlrm_main/0 [0]:Traceback (most recent call last): dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module> dlrm_main/0 [0]: main(sys.argv[1:]) dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 817, in main dlrm_main/0 [0]: model = DistributedModelParallel( dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 232, in __init__ dlrm_main/0 [0]: self.init_data_parallel() dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 266, in init_data_parallel dlrm_main/0 [0]: self._data_parallel_wrapper.wrap(self, self._env, self.device) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 97, in wrap dlrm_main/0 [0]: DistributedDataParallel( dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__ dlrm_main/0 [0]: _verify_param_shape_across_processes(self.process_group, parameters) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes dlrm_main/0 [0]: return dist._verify_params_across_processes(process_group, tensors, logger) dlrm_main/0 [0]:RuntimeError: [/opt/conda/conda-bld/pytorch_1670525552843/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.20.0.2]:54499 dlrm_main/0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25) of binary: /opt/conda/bin/python dlrm_main/0 [0]:libcuda.so.1: cannot open shared object file: No such file or directory dlrm_main/0 [1]:libcuda.so.1: cannot open shared object file: No such file or directory dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}} dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}} dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}} dlrm_main/0 [0]:{'adagrad': False, dlrm_main/0 [0]: 'allow_tf32': False, dlrm_main/0 [0]: 'batch_size': 32, dlrm_main/0 [0]: 'collect_multi_hot_freqs_stats': False, dlrm_main/0 [0]: 'dataset_name': 'criteo_1t', dlrm_main/0 [0]: 'dcn_low_rank_dim': 512, dlrm_main/0 Traceback (most recent call last): dlrm_main/0 File "/opt/conda/bin/torchrun", line 33, in <module> dlrm_main/0 sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper dlrm_main/0 [0]: 'dcn_num_layers': 3, dlrm_main/0 [0]: 'dense_arch_layer_sizes': [512, 256, 64], dlrm_main/0 [0]: 'drop_last_training_batch': False, dlrm_main/0 [0]: 'embedding_dim': 64, dlrm_main/0 [0]: 'epochs': 1, dlrm_main/0 [0]: 'evaluate_on_epoch_end': False, dlrm_main/0 [0]: 'evaluate_on_training_end': False, dlrm_main/0 [0]: 'in_memory_binary_criteo_path': None, dlrm_main/0 [0]: 'interaction_branch1_layer_sizes': [2048, 2048], dlrm_main/0 [0]: 'interaction_branch2_layer_sizes': [2048, 2048], dlrm_main/0 [0]: 'interaction_type': <InteractionType.ORIGINAL: 'original'>, dlrm_main/0 [0]: 'learning_rate': 15.0, dlrm_main/0 [0]: 'limit_test_batches': None, dlrm_main/0 [0]: 'limit_train_batches': None, dlrm_main/0 [0]: 'limit_val_batches': None, dlrm_main/0 [0]: 'lr_decay_start': 0, dlrm_main/0 [0]: 'lr_decay_steps': 0, dlrm_main/0 [0]: 'lr_warmup_steps': 0, dlrm_main/0 [0]: 'mmap_mode': False, dlrm_main/0 [0]: 'multi_hot_distribution_type': None, dlrm_main/0 [0]: 'multi_hot_sizes': None, dlrm_main/0 [0]: 'num_embeddings': 100000, dlrm_main/0 [0]: 'num_embeddings_per_feature': None, dlrm_main/0 [0]: 'over_arch_layer_sizes': [512, 512, 256, 1], dlrm_main/0 [0]: 'pin_memory': False, dlrm_main/0 [0]: 'print_lr': False, dlrm_main/0 [0]: 'print_progress': False, dlrm_main/0 [0]: 'print_sharding_plan': False, dlrm_main/0 [0]: 'seed': None, dlrm_main/0 [0]: 'shuffle_batches': False, dlrm_main/0 [0]: 'shuffle_training_set': False, dlrm_main/0 [0]: 'synthetic_multi_hot_criteo_path': None, dlrm_main/0 [0]: 'test_batch_size': None, dlrm_main/0 [0]: 'validation_auroc': None, dlrm_main/0 [0]: 'validation_freq_within_epoch': None} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "dlrm_dcnv2", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 7}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 11}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 15}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 19}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 23}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 64, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 705}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 709}} dlrm_main/0 return f(*args, **kwargs) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "seed", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 713}} dlrm_main/0 run(args) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run dlrm_main/0 elastic_launch( dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ dlrm_main/0 return launch_agent(self._config, self._entrypoint, list(args)) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent dlrm_main/0 raise ChildFailedError( dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: dlrm_main/0 ============================================================ dlrm_main/0 dlrm_main.py FAILED dlrm_main/0 ------------------------------------------------------------ dlrm_main/0 Failures: dlrm_main/0 [1]: dlrm_main/0 time : 2024-08-05_20:27:09 dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 dlrm_main/0 rank : 1 (local_rank: 1) dlrm_main/0 exitcode : 1 (pid: 26) dlrm_main/0 error_file: <N/A> dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html dlrm_main/0 ------------------------------------------------------------ dlrm_main/0 Root Cause (first observed failure): dlrm_main/0 [0]: dlrm_main/0 time : 2024-08-05_20:27:09 dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 dlrm_main/0 rank : 0 (local_rank: 0) dlrm_main/0 exitcode : 1 (pid: 25) dlrm_main/0 error_file: <N/A> dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html dlrm_main/0 ============================================================ torchx 2024-08-05 13:27:10 INFO Job finished: FAILED torchx 2024-08-05 13:27:10 ERROR AppStatus: msg: <NONE> num_restarts: -1 roles: - replicas: - hostname: dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 id: 0 role: dlrm_main state: !!python/object/apply:torchx.specs.api.AppState - 5 structured_error_msg: <NONE> role: dlrm_main state: FAILED (5) structured_error_msg: <NONE> ui_url: null
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.
The text was updated successfully, but these errors were encountered: