Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to evaluate the results #9

Open
glofru opened this issue Aug 26, 2023 · 0 comments
Open

Unable to evaluate the results #9

glofru opened this issue Aug 26, 2023 · 0 comments

Comments

@glofru
Copy link

glofru commented Aug 26, 2023

Hello,

I am trying to run these models to evaluate the results, however I am not able to do that due to errors at runtime.

The best "result" I could get is by with this Dockerfile (at the root of the project):

FROM nvidia/cuda:11.4.3-cudnn8-devel-ubuntu18.04

ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

# Install system dependencies
RUN apt-get update && \
    apt-get install -y \
    git \
    wget \
    python3-pip \
    python3-dev \
    python3-opencv \
    python3-six

RUN python3 -m pip install --upgrade pip

RUN pip3 install setuptools openmim

# Install PyTorch and torchvision
RUN pip3 install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
RUN python3 -m pip install h5py albumentations tensorboardX gdown scipy

RUN python3 -m mim install mmcv

# Upgrade pip

WORKDIR /

RUN wget http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/nyu_depth_v2_labeled.mat -O nyu_depth_v2_labeled.mat

RUN git clone https://github.com/vinvino02/GLPDepth.git --depth 1

RUN mv GLPDepth/code/utils/logging.py GLPDepth/code/utils/glp_depth_logging.py


# Set the working directory
WORKDIR /app


RUN python3 ../GLPDepth/code/utils/extract_official_train_test_set_from_mat.py ../nyu_depth_v2_labeled.mat ../GLPDepth/datasets/splits.mat ./data/nyu_depth_v2/official_splits/


# RUN ln -s data ait/data


COPY requirements.txt requirements.txt

RUN python3 -m pip install -r requirements.txt

COPY . .

RUN rm -rf .git

Built the Dockerfile with:

sudo docker build -t mde . -f Dockerfile

And run with:

sudo docker run --name mde-test --gpus all --ipc=host -it --rm -v $(pwd):/app mde

Finally running the evaluation command. For example:

cd ait
python3 -m torch.distributed.launch --nproc_per_node=1 code/train.py configs/swinv2b_480reso_parallel_depthonly.py  --cfg-options model.task_heads.depth.vae_cfg.pretrained=../models/vqvae_depth_2bp.pt --eval ../models/ait_depth_swinv2b_parallel.pth

In this way, the inference process is launched, eventually an anonymous error happen:

eval task depth
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 654/654, 2.5 task/s, elapsed: 262s, ETA:     0sERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 34) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
===================================================
code/train.py FAILED
---------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-26_03:01:18
  host      : f50427e7ad50
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 34)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 34
===================================================

Are the authors able to provide the versions of all the software they are using? In particular:

  • Linux version and distribution
  • CUDA version
  • Python version
  • Packages version (in the requirements, some versions are missing)
  • Any other relevant information about

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant