Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training phase sanity check fails by loading "../../data/imagenet/val" as an image #22

Open
wd255 opened this issue Jul 13, 2024 · 0 comments

Comments

@wd255
Copy link

wd255 commented Jul 13, 2024

Original issue here: #17 Posting this new issue as the old one was marked completed and I cannot reopen.

Thanks @ShiFengyuan1999 for replying but I don't think the problem is solved. After replacing the path as instructed with , I still got similar error that raises when trying to load "/val" as image with PIL, but it's actually a directory. I think what the code is trying to do is reading whatever is under imagenet dir path as image. But as the dir structure you suggested

imagenet
└── train/
    ├── n01440764
        ├── n01440764_10026.JPEG
        ├── n01440764_10027.JPEG
        ├── ...
    ├── n01443537
    ├── ...
└── val/
    ├── ...

It's not the case. Am I structuring the folder in a wrong way?


Following the README.md but have the following error:

File "/home/duanwei/ML/Open-MAGVIT2/main.py", line 93, in [903/1986]
main()
File "/home/duanwei/ML/Open-MAGVIT2/main.py", line 87, in main
cli = LightningCLI(
^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 388, in init
self._run_subcommand(self.subcommand)
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 679, in _run_subcommand
fn(**fn_kwargs)
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1031, in _run_stage
self._run_sanity_check()
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1060, in _run_sanity_check
val_loop.run()
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 128, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in next
batch = super().next()
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in next
batch = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in next
out = next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 142, in next
out = next(self.iterators[0])
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/root/anaconda3/lib/python3.11/site-packages/torch/_utils.py", line 644, in reraise
raise exception
IsADirectoryError: Caught IsADirectoryError in DataLoader worker process 0.

It seems that the sanity check code here
def _run_stage(self) -> Optional[Union[_PREDICT_OUTPUT, _EVALUATE_OUTPUT]]:
# wait for all to join if on distributed
self.strategy.barrier("run-stage")

zero_grad_kwargs = {} if _TORCH_GREATER_EQUAL_2_0 else {"set_to_none": True}
self.lightning_module.zero_grad(**zero_grad_kwargs)

if self.evaluating:
    return self._evaluation_loop.run()
if self.predicting:
    return self.predict_loop.run()
if self.training:
    with isolate_rng():
        # self._run_sanity_check()
    with torch.autograd.set_detect_anomaly(self._detect_anomaly):
        self.fit_loop.run()
    return None
raise RuntimeError(f"Unexpected state {self.state}")

I think it's caused by wrong data path setup.

For now I put the original ILSRVC directory of the kaggle version of imagenet under "../../data/" and renamed it to "imagenet". So that it has the structure as specified by README.md. Is this setting correct?
imagenet
└── train/
├── n01440764
├── n01440764_10026.JPEG
├── n01440764_10027.JPEG
├── ...
├── n01443537
├── ...
└── val/
├── ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant