-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: regressions introduced by latest main #831
Comments
I ran another set of tests based on 1b8656c and it seems the issue was already present at this commit, i will come back further |
Is the write-after-free issue no longer reproducible? I can debug that one while you try to find when this one happened. |
So far (on 79a8dcf) I haven't been able to repro. I'm not sure what conditions lead to it happening; perhaps because of parallel threads during test execution |
Strange. That commit just replaces an assert with logic to handle that assert. Since the test used to pass it means that the assert wasn't triggered, therefore entering the branch that generates the exact same instructions as the previous commit. There should be no difference in the generated code between before and after this commit. |
🤔 let me retry, perhaps a cache issue. in the meantime, I added repro guidelines based on latest main. It's a bit tedious because i need to update several deps repos and might have lost myself in the process 😓 |
I think the issue was already there in the commit that updates to LLVM 19... Either that or I don't know how to run your tests. I had to comment the lines that disable the stack tests in I also tried running |
I didn't have the issue in branch
Oh, sorry, I committed this file 🤦
Native is running in the blockifier dependency, which on the branch above is fixed to 79a8dcf, so even if you update the runtime locally, the executor will be compiled based on the Native version used in the "sequencer" dependency. |
I changed that dependency too, otherwise it wouldn't compile. Both blockifier and your crate pointed to my local folder. |
Ok - I re-ran on c866fbf, it works, I must've forgotten to clear the cached executor. |
I'm surprised, I don't see any compilation error. I edited the skipfile for it to work properly |
Here's the LLDB trace for the write-after-free issue when running: Here, we don't see cairo native being active: it's because the initial call is made to a Cairo0 account, thus is executed by the regular CairoVM. The call flow is I guess it fails in the relayer step?
|
LLDB trace when running:
-> In this one, we don't even reach the step where a tx is sent 🤔 Could there be side effects due to multithreading and the execution of another test in parallel? |
You're right. I implemented an if statement backwards. Could you try with #832? |
Fixed 👍 |
What else was pending?
|
Unfortunately it still fails our CI... I think it consumes too much memory There's no explicit shutdown message so I believe we're just hitting a resource limit |
The max memory usage hit in a run previously (before the snapshot-drop fixes) was ~13Gb, now it's 25Gb which might be too big for our runner. So we probably will need to upgrade it, or find a way to decrease memory consumption in Native mode |
Wait, don't upgrade yet. I know for a fact we still have some memory leaks. I'm working on a piece of code that'll help me find them. I'll let you know once I've fixed the ones I've found. |
thank you very much 🙏 |
Could you try with #833? That one fixes all the memory leaks I could find running all our JIT tests. |
hm actually I think there is a very specific issue that happens in specific test cases that end up running forever: CI took 6hrs to run and ended up failing due to time limits (instead of finishing in ~50mins). I'll investigate this manually later today https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11222856617/job/31196253181?pr=1011 |
I just hit an issue when running locally:
will add instructions to repro from scratch (if I manage to) |
^ so actually the above happens if after I increase the default stack size with |
I'm having difficulties running that command. The compiler complains about
|
@tcoratger can you help here? |
@azteca1998 i'm unable to repro your issue from a fresh clone. I have:
can you try a I also synced the repo with latest cairo-native main, do a edit: after syncing with latest cairo-native main, i can't deserialize my v1 classes anymore to run the test anymore - investigating what changed |
Yeah probably some versioning conflict (maybe between revm and reth which use revm as an underlying dependency) because I'm sure that the conversion mentioned in your error message is available in reth here https://github.com/paradigmxyz/reth/blob/75b7172cf77eb4fd65fe1a6924f75066fb09fcd1/crates/primitives-traits/src/account.rs#L165 |
I'm not sure it'll be fixed, but could you try with the latest main? We just merged a fix for a bug that may or may not be related to this issue (it's difficult to tell). |
Nothing changed :( I am still stuck with this issue:
This used to be fine with 16Mib before iirc Did #833 significantly increase stack usage ? |
It shouldn't have changed the stack memory usage. It could be that it now takes more for it to crash, therefore needs a bit more stack. I'll keep investigating then, thanks for checking it. |
Hello, I've finally managed to fix it. It took some modifications in blockifier to be able to compile and run it but it ran. It seems the issue (or the part that was causing the segfault) was related to the bug I fixed the other day, but I forgot to fix it in the syscall handler. @edg-l noticed and fixed it, so right now it doesn't segfault anymore, but still doesn't pass:
|
@azteca1998 is your also: i don't think you're using the latest branch version (can you git pull?) in the latest branch you would see the reason why the executor cannot be created, and you wouldnt need blockifier changes I believe (rev edit: bumped native latest main, ef-test rev is: |
Just want to comment that native main doesn't yet have this fix in #853 so you should try with that |
ah, ok i will probably wait for it to be on main as i'm focused mainly on something else atm |
I thought it was already merged, but it seems that wasn't the case. Anyways, one of the PRs merged today has fixed the segfault (it isn't failing anymore in main). |
Checking today on 84ceaa3, no improvements I'm wondering if it's a side effect from increasing the RUST_MIN_STACK size to a high number? |
What do you mean? It still segfaults for you? |
still getting a stack overflow with default / low stack sizes, or a |
Strange, did you |
Closing in to continue in #870 |
main (commit hash to confirm again) introduced a few regressions.
clone https://github.com/kkrt-labs/ef-tests
ensure correct branch
git checkout dev/bump-native
make
make setup-kakarot
cargo test --test tests stStack --no-fail-fast --features "v1,native,ci" -- --nocapture
^ this will run the full suite for the stStack family. The first issue is also reproducible when running a specific test, with
e.g.
cargo test test_sar_2_xor_255_256_d0g0v0_Cancun --features v1,native -- --nocapture
The execution is no longer the same. What used to pass, now fails with:
^ This happens for practically all tests and highlights a difference somewhere during execution. Probably added by one of the latest commits un the last PR
^ This happens when running multiple tests at one - but i don't know exactly which one triggers this
The text was updated successfully, but these errors were encountered: