-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"DataLoader worker exited unexpectedly" #22
Comments
Adding
This happens at image 264, just as it did when |
Wow, I've never seen this before. Have you watched your RAM increase during this process? Does it top out your memory? Definitely need better support for resuming index. I am planning to convert the database dictionary to an SQLite file, maybe that will help |
Oh interesting question! I logged the memory usage over time and got this: https://i.imgur.com/ldJIlUb.png X axis is seconds. Looks like each image uses a different amount of memory, and some take a lot. The one at 1400 seconds or so is big but the system can handle it, and then the one at the end is trying to use ~38 gig of memory(?) and the OOM killer takes it out. |
I'm afraid that it's probably batch 264 rather than a specific image. The
data loader is a pie torch feature on specifically using for batch
processing, so it's doing your 60,000 images in only 500 batches or
whatever.
It absolutely should log the current working file if it's going to crash
tho, you're right
…On Tue, Aug 31, 2021, 8:44 AM Rob Miles ***@***.***> wrote:
Good question! I logged the memory usage over time and got this:
https://i.imgur.com/ldJIlUb.png
X axis is seconds. Looks like each image uses a different amount of
memory, and some take a lot. The one at 1400 seconds or so is big but the
system can handle it, and then the one at the end is trying to use ~38 gig
of memory(?) and the OOM killer takes it out.
That's image 264. So either that image is... very large, or it's broken in
a way that makes memery leak memory like crazy. Either way I'd like memery
to tell me the path of the file it's working on, so I can figure this image
out.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN7DJVE33ZTFBF62Z2V2HL3T7TTGXANCNFSM5C3L33IQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Ah, that makes sense. Does that mean I could maybe fix it in this instance by reducing the batch size? |
That could work!
Still doesn't explain why that one batch uses so much memory but could get
you through the bottleneck for now.
I think I hard-coded some batch and number of workers variables that should
actually be flexible based on hardware 😬
…On Thu, Sep 2, 2021, 3:25 AM Rob Miles ***@***.***> wrote:
Ah, that makes sense. Does that mean I could maybe fix it in this instance
by reducing the batch size?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN7DJVA25GKCE2UPQAWG6TDT747IFANCNFSM5C3L33IQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Related to #13, in the sense that this issue is made worse by the indexing process not being resumable.
When indexing a large directory of various types of files (with 69834 images), I get this error:
I'm guessing that the DataLoader process is being killed by the Linux OOM killer? I have no idea what I can do about that though.
Let me know if there's any other information that would help
The text was updated successfully, but these errors were encountered: