-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent crash in LLVM when used from OSL (Open Shading Language) in interactive use case (seemingly due to insufficient 32 bit relocation offsets?) #65641
Comments
Messing with the relocations is just going to corrupt your code. You need to fix the code generation to choose instructions that have operands with the right width. I think JIT compilation tries to default to the large code model, which should avoid relocations like this, but maybe you're somehow setting the wrong code model, or maybe there's some code in LLVM that isn't checking the current code model correctly. |
@llvm/issue-subscribers-orcjit |
Looked briefly for what might be generating that particular relocation, and mostly what comes up is stuff related to exception-handling. So maybe not fixable by just making the fields wider; I think those datastructures actually require a 32-bit offset. So maybe you need to rearrange memory so the relevant bits are actually within 2GB. Or maybe if you don't need exception-handling, you can disable it. |
@ZapAndersson your hack replaces a relocation that writes a 32-bit value with one that writes a 64-bit value, so it'll be trashing the following 4 bytes everywhere. If I think the canonical Right Answer here is for LLVM to use JITLink (the new JIT linker) by default for COFF/x86-64, and for OSL to use a slab-based allocator that reserves a 2Gb chunk of address space up front. If it's a long-running shader session OSL would need to either free unused JIT'd code to stay under the 2Gb limit, or distribute JIT'd code amongst multiple JITDylibs so that each stays under its respective 2Gb limit. Disabling or stripping out exception tables (if that's the only place this relocation comes up) is also an option, but seems more brittle to me. @sunho Do you happen to remember where |
@ZapAndersson We should also check out the large code model option as @efriedma-quic suggested. If it's not the default then switching to it may be an easy fix for now. |
Note that IMAGE_REL_AMD64_ADDR32NB relocation is only used for seh frame (i.e. for stack walking) and debugging. If you don't use the windows exception or debug info, you can just disable those relocations on the RuntimeDyLD side as hack. (only IMAGE_REL_AMD64_ADDR32NB though, disabling other relocation will indeed break the code) If you want to be correct about these, you do need to use JITLink which does guarantee the offsets to be within 32 bit. I think hacking on codegen side is incorrect even as a hack since It will modify more bytes than intended. Just comment out the line you are seeing in the first screenshot. |
Thank you guys! Yeah I need to be clear, I have absolutely no idea what I am doing here ;) So can I have a bit of an "explain like I'm five" explanation of how to a) "disable exceptions" To me, though, it doesn't seem like the code itself is large (it kinda isn't) but when other code in other threads are allocating huge chunks of memory (because it's rendering Godzilla stomping a skyscraper with gazillions of polygons) we happen to get allocations that are far apart? But I have no idea whatsoever how LLVM's memory handling works, there's the cryptic (to me) comment in the code about the "The MemoryManager can make sure this is always true by forcing the memory layout to be: CodeSection < ReadOnlySection < ReadWriteSection", which sadly is greek to me. To be clear, I didn't write this code, I know pretty much less than zero about this code, I'm just running into this bug, been stabbing at it with a glowy stick for a workaround, and need help from anyone who actually knows sometihing, so any gem of wisdom is HIGHLY APPRECIATED, and THANK YOU ALL for any insight you can give me. |
So are you saying that if we con't need debugging or stack walking (as far as I know, we don't) this message is pointless and won't actually break anything, so I can just comment out the fatal error itself and be on my merry way, or what are you actually telling me? (Total LLVM Noob warning, sorry). Again, thank you for all the answers, they are just a hair over my LLVM understanding (which is zero) |
I think I figured out where (inside OSL) the code model is picked, there is currently a commented out line that says https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1442
...and I see no other place in OSL where it references the "CodeModel". I will poke at this with a blunt stick and see if it changes anything. |
No worries. You're into a pretty obscure area of the stack here. Many LLVM engineers would be asking the same questions if they hit this. :) I just tried using the large code model*, but it didn't seem to help:
Yep -- if In general we do want this metadata to be usable, so from our perspective the proper general case fix is two-part: (1) LLVM uses JITLink for COFF/x86-64, and (2) Clients use the |
Large code model isn't a thing on Windows: the COFF spec has 4-byte fixed-width fields that restrict the size of the resulting image. So the exception-handling metadata also has 4-byte fixed-width fields; see https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64 . |
@efriedma-quic Thanks -- it's great to get confirmation of that. Do you think it'd be worth adding a warning / error when Either way this supports the long term plan I outlined above to move to JITLink and require slab allocation. |
Thank you, that make me feel at least slightly less like an idiot..... slightly .... :) So tell me more about these memory managers, you can see the way OSL sets this up in this function https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1430 and it is just doing a new MemoryManager(m_llvm_jitmm))); which I presume is just instantiating some default memory manager(!?!?) Should this code look... different... somehow? I appreciate all the handholding, I truly do. Cavamen with a blunt stick here.... /Z |
Sure thing. In both RuntimeDyld (the older JIT linker that MCJIT uses), and JITLink (the new JIT linker) memory is allocated via abstract memory manager interfaces. The interfaces differ between the two, but at a high level they're both responsible for reserving memory for JIT'd code and data, transporting the contents of linker memory into the address spaces that they reserved (this can include transporting it across process boundaries), applying memory protections, and eventually freeing the memory when it's no longer needed. Looks like OSL has a custom memory manager that forwards to an LLVM manager under the hood, but with a twist: /// MemoryManager - Create a shell that passes on requests
/// to a real LLVMMemoryManager underneath, but can be retained after the
/// dummy is destroyed. Also, we don't pass along any deallocations.
class LLVM_Util::MemoryManager final : public LLVMMemoryManager
This code will need to change in the future when OSL switches to JITLink (which OSL will want to do at some point because we're going to deprecate MCJIT in the not too distant future). It's not clear to me whether OSL will need a custom allocator, or whether the new off the shelf allocators will work. It depends on why OSL wants to retain that memory. Do you know anyone who might know why it's doing this?
No worries at all! |
With my limited understanding, peeking at this memory manager code, it seems that "all" the OSL memory manager is doing is wrapping an existing llvm::SectionMemoryManager object and holding on to it, such that many threads can use the same memory manager, and once all the threads are done, the memory manager can go out of scope and clean up. Or, well, that's my approximation of an understanding of what it is supposedly doing. Is this... bad? |
Okay, here's what I've figured out so far. OSL uses a memory manager, that is held by rendering threads per-thread-info stuff. And this memory manager is kept around until the last rendering thread dies. Except... we use TBB for rendering. TBB actually has a set of worker threads that are always in flight. So those threads never die. So the memory manager ends up being kept around forever. That wouldn't be a big deal, in the normal case. Except I also see this: Soo, if the wrapper doesn't pass on deallocations, it will end up allocating forever, eventually breaking the 2Gb barrier and triggering this fatal error problem....!? Maybe? |
It's hard to tell without looking at the failure case (you could add an allocation counter to the allocator to test your theory). It seems like a very good explanation though. The other possibility is that you get unlucky with an OS Either way a switch to ORC + JITLink + SlabAllocation will fix this: It supports thread-safe deallocation of each object file / module added to the JIT. If you want to consider a switch the next question is what platforms OSL needs to run on (OS + architecture). I can let you know whether ORC supports them all, and suggest next steps for moving over. |
In my particular case (for 3ds Max) it's Windows x64 but other apps of ours (like the Arnold renderer) are Linux x64 and Mac (64 bit) as well. I'll bring all this up at the OSL Technical Steering Comittee meeting that is later today, se if we can figure out what is best. |
Sounds good. JITLink is already mature for x86-64 on Linux, macOS and Windows, and for arm64 on Linux and macOS. My first impression is that it's worth testing out a port. |
We also use some black voodoo to turn it into PTX for running on NVidia GPU's (way above my paygrade). I think the potentially bigger issue is what LLVM versions we can accept. Since OSL is something integrated into other apps, often OSL needs to match the LLVM version used with one used by the APP in some cases. (Or so I'm told, I could be wrong) |
Ahh. Yeah -- that might be a bigger problem. Please let me know what you find out. |
Want to join in on the OSL Technical Steering Comittee meeting, we'll be talking about this there? It's in about 50 minutes from me posting this message, on Zoom. :) No pressure :) but you would be talking directly to the people who actually understand something, instead of me :) Reminder: OSL TSC meeting ( every other week ) When: Where: Organizer: Chris Kulla [email protected] Description: |
Hi Zap, Any idea when the JIT part would be discussed? I'm flat out at the moment and can't spare an hour, but could probably drop in for 20 mins. |
Actually I've got a 2:30 meeting anyway -- So I'll come for the first 30. :) |
Oh I expect us to start with that |
@ZapAndersson Looks like I need to be an authorized participant. Could you ask them to invite me? I'm [email protected]. |
thank you so much for this!! |
No worries. :) |
I don't think this is related to the crash, but why do we keep saying 2GB when the code is comparing against a |
@ThiagoIze |
We see similar overflows in signed 32-bit relocations -- that may have led to the typo in the original code. |
I reported this bug first as an OSL bug here: AcademySoftwareFoundation/OpenShadingLanguage#1712 but it seems the "bug" is in LLVM itself.
TL;DR - in a heavey interactive session when compiling OSL shaders while rendering high-resolution images, we sometimes, super-intermittently, seemingly randomly, end up crashing (= fatal error aborting) here:
Basically, whatever assumptions is being made of memory blocks being within 2Gb of each other, do not hold (probably cause other threads have been allocating multi-gigabyte chunks in the meantime).
This of course explains the intermittent behavior of the bug, and that it only happens while heavy things are happening concurrently in other threads (which has made it quite difficult to track down.)
Not understanding much of LLVM, but realizing that 32 bit offsets is highly likely not to be enough, as a test, I made an ugly hackaround and replaced the use of IMAGE_REL_AMD64_ADDR32NB relocations (which seems to be the 32 bit offsets) with IMAGE_REL_AMD64_ADDR64 (which looks to my eye as full 64 bit offsets?) and the problem goes away!!
But being an utter LLVM coding noob, I sincerely doubt this is the "right solution".
For the "long explanation", see the issue in the OSL GitHub linked above, including my "godawful hackaround".
What is the "right solution" here, or are we doing something fundamentally wrong somewhere?
The text was updated successfully, but these errors were encountered: