Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Replicate Depth Estimation Results on NYUv2 Depth Dataset (Including After Applying Eigen Crop) #484

Open
erikjagnandan opened this issue Nov 22, 2024 · 0 comments

Comments

@erikjagnandan
Copy link

erikjagnandan commented Nov 22, 2024

I am trying to run depth estimation with the small backbone size and DPT decoder on the NYUv2 Depth Dataset, using the general framework presented in the depth_estimation.ipynb notebook. I downloaded the NYUv2 dataset from Kaggle at https://www.kaggle.com/datasets/soumikrakshit/nyu-depth-v2?resource=download and ran depth estimation on the ~50k training images by adding the code below to the end of depth_estimation.ipynb.

At first, I was simply computing RMSE on the entire image (without Eigen crop or min/max thresholding) and computed the RMSE as 0.433. However, after reading issue #227, I looked into the pre_eval and evaluate methods in the Monocular-Depth-Estimation-Toolbox repository and added min/max thresholding (with min threshold 1e-3 and max threshold 10, as these were the default values for the thresholds given in the repository), as well as Eigen crop at the same pixel range [45:471, 41:601] as was used in the repository. However, even after making this change, the RMSE is still 0.410, far from the value of 0.356 reported in the paper. For clarity, I am calculating the RMSE by computing the MSE for each image separately, averaging the MSEs, and then taking the square root of this average, which is (to my understanding) the correct implementation of RMSE.

I understand that I am running on training data, whereas the performance reported in the paper is for the validation data, but the disparity should not be this severe (~16% increase in RMSE), especially as validation performance should be, in general, no better than training performance.

Now that I have incorporated Eigen crop and min/max thresholding and the results are still not being replicated, is there any step which was used in the paper which I have left out here? From looking into the Monocular-Depth-Estimation-Toolbox repository, it looks that I have performed all of the steps included there. Alternatively, is there some simple way that could I import code from the Monocular-Depth-Estimation-Toolbox repository and use it to evaluate the DINOv2 Depth Estimator? From looking at their ReadMe, it seems that this would be quite nontrivial, as DINOv2 is not listed as a supported backbone.

Some notes about my code:

data_list is a list containing one element per training sample, with each element being a two-element list in which the first element is the name of the folder (e.g. basement_0001a_out) and the second element is the index of the sample within the folder

I multiply by 10.0 when loading the ground truth depth map since the ground truth depth maps are provided at 1/10th scale. When loading the ground truth depth maps as is, without multiplication, the depth predictions are on average almost exactly 10 times the ground truth depth, and after multiplying by 10, the histogram of depth predictions and ground truth depths lines up accurately.

`mse = torch.zeros(len(data_list))

print_increment = 100

for i in range(len(data_list)):

selected_dataset, selected_image = data_list[i]

ground_truth_depth = 10.0 * torch.tensor(np.array(Image.open(data_directory + "/" + selected_dataset + "/" + 
str(selected_image+1) + ".png")).astype(float)/255.0, dtype=torch.float32)

image = Image.open(data_directory + "/" + selected_dataset + "/" + str(selected_image+1) + ".jpg")

rescaled_image = image.resize((image.width, image.height))
transformed_image = transform(rescaled_image)
batch = transformed_image.unsqueeze(0).cuda() # Make a batch of one image

with torch.inference_mode():
    result = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

eigen_crop = True
if eigen_crop:
    min_threshold = 1e-3
    max_threshold = 10
    valid_mask = (ground_truth_depth > min_threshold) & (ground_truth_depth < max_threshold)
    eigen_mask = torch.zeros(ground_truth_depth.shape, dtype=torch.bool)
    eigen_mask[45:471, 41:601] = True
    eval_mask = torch.logical_and(valid_mask, eigen_mask)
    mse[i] = F.mse_loss(result.cpu()[eval_mask], ground_truth_depth[eval_mask]).item()
else:
    mse[i] = F.mse_loss(result.cpu(), ground_truth_depth).item()

if i % print_increment == print_increment - 1:
    print("Images " + str(i-print_increment+2) + "-" + str(i+1) + ": MSE = " + str(mse[i-print_increment+1:i].mean().item()))

print("Avg MSE = " + str(mse.mean().item()))
print("Avg RMSE = " + str(torch.sqrt(mse.mean()).item()))`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant