Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: invalid memory reference #870

Open
enitrat opened this issue Oct 20, 2024 · 23 comments
Open

bug: invalid memory reference #870

enitrat opened this issue Oct 20, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@enitrat
Copy link

enitrat commented Oct 20, 2024

Where are you using Cairo Native
Kakarot EF-Test suite. Running the 20k ethereum tests using blockifier + cairo native + kakarot zkevm

Version
last main

Describe the bug

* thread #2, name = 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f5c6bf0)
  * frame #0: 0x00000001473975a0 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13513 + 36
    frame #1: 0x000000014736798c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13487 + 13692
    frame #2: 0x0000000147331a8c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13450 + 23400
    frame #3: 0x0000000147a1f600 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13635 + 40296b
...
...
    frame #103: 0x00000001472fc398 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13425 + 4520
    frame #104: 0x00000001472c4248 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13402 + 71760
    frame #105: 0x00000001472a284c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13382 + 14772
    frame #106: 0x0000000147291e88 2170791277312818364151319127822404238116644699931524383412296455457210549381`_mlir_ciface_f22 + 5372
    frame #107: 0x000000010076d66c stStaticCall-a00b9a1f5b0b2270`invoke_trampoline + 108
    frame #108: 0x00000001004a3dd8 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::native::utils::run_native_executor::hd53e73f315696524(native_executor=0x00006000021c10f0, function_id=0x000000012c823040, call=<unavailable>, syscall_handler=<unavailable>) at utils.rs:45:28
    frame #109: 0x000000010046598c stStaticCall-a00b9a1f5b0b2270`blockifier::execution::native::entry_point_execution::execute_entry_point_call::h0820f83788dc89ed(call=CallEntryPoint @ 0x000000016ffdfc38, contract_class=NativeContractClassV1 @ 0x000000016ffdef38, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffdf6d8, resources=0x000000016fff0538, context=0x000000016ffef588) at entry_point_execution.rs:33:18
    frame #110: 0x000000010059f768 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::execution_utils::execute_entry_point_call::hba155b0ecfc46347(call=CallEntryPoint @ 0x000000016ffe0938, contract_class=ContractClass @ 0x000000016ffdf7d0, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffe0158, resources=0x000000016fff0538, context=0x000000016ffef588) at execution_utils.rs:78:19
    frame #111: 0x000000010049e5d0 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::entry_point::CallEntryPoint::execute::h8763f8070c7f5197(self=CallEntryPoint @ 0x000000016ffe19a0, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffe0a08, resources=0x000000016fff0538, context=0x000000016ffef588) at entry_point.rs:103:9

To Reproduce

Clone https://github.com/kkrt-labs/ef-tests
Checkout dev/bump-native
make && make setup-kakarot
Make sure CAIRO_NATIVE_RUNTIME_LIBRARY is a defined environment variable
run
cargo test test_static_ABAcalls0_d1g0v0_Cancun --features v1,native -- --nocapture
this should error with
(signal: 11, SIGSEGV: invalid memory reference)

The backtrace is from lldb:
lldb path/to/test
platform settings -w /path/to/ef-tests/crates/ef-testing
run

Desktop (please complete the following information):

  • OS: MacOS,
  • Architecture arm64
@edg-l edg-l added the bug Something isn't working label Oct 21, 2024
@edg-l
Copy link
Member

edg-l commented Oct 21, 2024

I tried to build this on linux x86_64 but i get a build error

error[E0412]: cannot find type `integer_t` in crate `libc`
   --> crates/ef-testing/src/models/result.rs:120:36
    |
120 |             / mem::size_of::<libc::integer_t>() as libc::c_uint;
    |                                    ^^^^^^^^^ help: a type alias with a similar name exists: `intptr_t`
    |
   ::: /home/edgar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libc-0.2.159/src/unix/mod.rs:21:1
    |
21  | pub type intptr_t = isize;
    | ----------------- similarly named type alias `intptr_t` defined here

warning: unused imports: `ACCOUNT_CONTRACT_CLASS`, `KAKAROT_CLASS`, and `UNINITIALIZED_ACCOUNT_CLASS`
  --> crates/ef-testing/src/evm_sequencer/sequencer/mod.rs:12:9
   |
12 |         ACCOUNT_CONTRACT_CLASS, ACCOUNT_CONTRACT_CLASS_HASH, BLOCK_GAS_LIMIT,
   |         ^^^^^^^^^^^^^^^^^^^^^^
13 |         ETH_FEE_TOKEN_ADDRESS, FEE_TOKEN_CLASS, FEE_TOKEN_CLASS_HASH, KAKAROT_ADDRESS,
14 |         KAKAROT_CLASS, KAKAROT_CLASS_HASH, KAKAROT_OWNER_ADDRESS, OPENZEPPELIN_ACCOUNT_CLASS,
   |         ^^^^^^^^^^^^^
15 |         OPENZEPPELIN_ACCOUNT_CLASS_HASH, RELAYER_ADDRESS, RELAYER_BALANCE, RELAYER_VERIFYING_KEY,
16 |         STRK_FEE_TOKEN_ADDRESS, UNINITIALIZED_ACCOUNT_CLASS, UNINITIALIZED_ACCOUNT_CLASS_HASH,
   |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(unused_imports)]` on by default

warning: unused import: `std::ptr`
  --> crates/ef-testing/src/models/result.rs:82:5
   |
82 | use std::ptr;
   |     ^^^^^^^^

For more information about this error, try `rustc --explain E0412`.
warning: `ef-testing` (lib) generated 2 warnings
error: could not compile `ef-testing` (lib) due to 1 previous error; 2 warnings emitted
warning: build failed, waiting for other jobs to finish...

@enitrat
Copy link
Author

enitrat commented Oct 21, 2024

I tried to build this on linux x86_64 but i get a build error

Must be a platform-specific thing:

can you comment line 151 and the definition of foo line 108? I used this to get memory usage logs after each test to identify families that caused memory leaks.

@azteca1998
Copy link
Collaborator

The make steup-kakarot step is failing for me. Is this expected?

ef-tests$ make setup-kakarot
rm -rf build/common
mkdir -p build/common
rm -rf build/v0
mkdir -p build/v0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8651  100  8651    0     0  25596      0 --:--:-- --:--:-- --:--:-- 25670
unzip -o kakarot-build.zip -d build/v0
Archive:  kakarot-build.zip
  inflating: build/v0/contracts_MockContractUpgradeableV0.compiled_contract_class.json
  inflating: build/v0/contracts_UninitializedAccount.contract_class.json
  inflating: build/v0/contracts_AccountContract.contract_class.json
  inflating: build/v0/contracts_UninitializedAccount.compiled_contract_class.json
  inflating: build/v0/contracts.starknet_artifacts.json
  inflating: build/v0/contracts_KakarotCore.compiled_contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV1.contract_class.json
  inflating: build/v0/contracts_ERC20.compiled_contract_class.json
  inflating: build/v0/contracts_ERC20.contract_class.json
  inflating: build/v0/contracts_Cairo1HelpersFixture.compiled_contract_class.json
  inflating: build/v0/contracts_AccountContract.compiled_contract_class.json
  inflating: build/v0/contracts_KakarotCore.contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV0.contract_class.json
  inflating: build/v0/contracts_Cairo1Helpers.contract_class.json
  inflating: build/v0/contracts.sierra.json
  inflating: build/v0/contracts_Cairo1Helpers.compiled_contract_class.json
  inflating: build/v0/contracts_Cairo1HelpersFixture.contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV1.compiled_contract_class.json
mv build/v0/build/* build/v0
mv: rename build/v0/build/* to build/v0/*: No such file or directory
make: *** [setup-kakarot-v0] Error 1

This seems to cause the precompiles to not be available when I try to test it:

    Finished `test` profile [unoptimized + debuginfo] target(s) in 18.76s
     Running tests/stStaticCall.rs (target/debug/deps/stStaticCall-022d7fbb57a5b4fd)

running 1 test
thread 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun' panicked at crates/ef-testing/src/evm_sequencer/constants.rs:66:120:
Failed to load precompiles contract class: No such file or directory (os error 2)

Location:
    crates/ef-testing/src/evm_sequencer/constants.rs:22:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun ... FAILED

failures:

failures:
    stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 477 filtered out; finished in 0.10s

error: test failed, to rerun pass `-p ef-testing --test stStaticCall`

@enitrat
Copy link
Author

enitrat commented Oct 25, 2024

Something broke with our new release yesterday. Will check and update

@enitrat
Copy link
Author

enitrat commented Oct 31, 2024

Hi @azteca1998 @edg-l you should be able to repro again. I updated native to ab47832

I also hid the ram usage prints under a macos target, so that you should be able to run it on other targets fine.

@azteca1998
Copy link
Collaborator

Hello, sorry for the delay.

We've been working on various stuff in native which made it so that ef-tests no longer compiles using the main branch of the sequencer and native's main branches. Could you please update it? Some of the recent changes may be involved in the bug.

@enitrat
Copy link
Author

enitrat commented Nov 20, 2024

Hi @azteca1998 i updated the ef-test repo to commit 28caad05e4a13f935acff83ea144df0b1bc24e2e with native 5e60089. The issue still occurs

@azteca1998
Copy link
Collaborator

Ok, I think I found the problem. It's the same as #816.

If I run the test with lldb and compare stack pointers I get the following:

Stack pointer (first frame): 0x00007ffff7a1da80
Stack pointer (crash frame): 0x00007ffff780bce0
Crash pointer              : 0x00007ffff780d138

The first number is the value of rsp (stack pointer) when the program started. The second is its value just before the crash. The last number is already outside the allocated stack size, therefore crashing the process.

If I run the test with a larger stack (1GiB in my test, probably works with way less stack) the test seems to run without problems.
You can make the tests run with a larger stack by using this environment variable: RUST_MIN_STACK=1073741824.

@enitrat
Copy link
Author

enitrat commented Nov 21, 2024

Ok, great, at the time I opeoned the issue increasing the stack size did not have any effect, which no longer seems the be the case. However, the current run is taking more than 2hours, up from 50mins before.

https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11947735661/job/33304216235?pr=1011

I will need to check why

@azteca1998
Copy link
Collaborator

Could it be that the test that crashed was run at about the 50 minutes mark, therefore stopping all other tests, while now all tests have to be run, taking more than 2 hours?

@enitrat
Copy link
Author

enitrat commented Nov 21, 2024

No, previously all 20k tests were passing in ~50mins (20mins compilation, 30mins runtime). I added some more debug info to pinpoint problem is exactly, will update you when I have results

@enitrat
Copy link
Author

enitrat commented Nov 21, 2024

2024-11-21T13:08:33.3640356Z Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' completed in 2942.0 seconds
2024-11-21T13:08:33.3641902Z test stReturnDataTest::stReturnDataTest::test_modexp_modsize0_returndatasize_d4g0v0_Cancun ... ok

Some tests take 45 mins to 1.5hrs to complete a single, easy test. Do you have any idea how to debug why this is happening? I'm not sure what to look for.

test_modexp_modsize0_returndatasize_d4g0v0_Cancun

That specific instance has a lot of felt252 dict operations because it's very unoptimized. I'll add some prints directly in the cairo code to see what happens in realtime, but that's a first thing we could be looking at.

@azteca1998
Copy link
Collaborator

azteca1998 commented Nov 21, 2024

Not sure if that's the case, especially since it didn't happen earlier, but dictionaries are implemented in Rust within the runtime. When running the JIT, the runtime used is the dependency of cairo-native, which for debug builds is built using the debug profile. Could you try running the test in release mode?

When running in AOT, the runtime used is the static library. If the tests are running in AOT, could you check if the linked library (the path in the env var) points to a release build?

@enitrat
Copy link
Author

enitrat commented Nov 21, 2024

JIT is new right? I think i'm running in AOT because the integration in Native uses AotExecutor.
What do you think I should be using for this specific usecase ?

I built the runtime running make runtime which builds in release mode, so it's a release build I believe.

@azteca1998
Copy link
Collaborator

azteca1998 commented Nov 21, 2024

JIT is the original executor, AOT came later, but I think you're probably using AOT.
If the runtime uses the release build then I'm not sure what could be causing the performance issues.

What do you think I should be using for this specific usecase ?

Both are fine. Internally, their interface is the same, so whatever works on one should work on the other. The sequencer is using AOT.

@enitrat
Copy link
Author

enitrat commented Nov 21, 2024

...
Start: Executing Modexp Precompile
Current test duration: 803.0s
...
WARNING: Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' has been running for over 120 seconds
Current duration: 1516.0s
End: MPNAT::from_big_endian operation
Start: modpow operation
End: modpow operation
End: Executing Modexp Precompile

There seems to be an issue with the operations performed in MPNAT::from_big_endian, executed twice, taking ~ 10 mins so 5mins each.

Here's a link to the function https://github.com/kkrt-labs/kakarot-ssj/blob/c2e4ccbfcd8d6b8d79668f34daab97080d6769f9/crates/utils/src/crypto/modexp/mpnat.cairo#L38-L106

It involves a lot of dict operations. Do you want me to add more granularity?

@azteca1998
Copy link
Collaborator

Can you provide the bytes argument? This way we'll be able to replicate it exactly.

@enitrat
Copy link
Author

enitrat commented Nov 22, 2024

Can you provide the bytes argument? This way we'll be able to replicate it exactly.

Hm so actually in that specific case I checked and we're sending 999188 bytes to the function. It's quite a lot, I can't really send it over to you.

edit: I opened old logs of using native to find out that this test actually used to pass when using native ae17dd3 🤔 so perhaps the big number of bytes is not that much of an issue and it should pass regardless?

see https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11810716802/job/32903209209 from PR kkrt-labs/kakarot-ssj#1021 (where the version of native is ae17dd3): download the logs, open 2_ef-tests _ ef-tests.txt, search for modexp_modsize0_returndatasize_d4g0v0_Cancun and you will see it passed

@enitrat
Copy link
Author

enitrat commented Nov 22, 2024

Just re-ran on my computer the stack using cairo-native ae17dd3

WARNING: Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' has been running for over 120 seconds
Current duration: 834.0s
... some additional operations ..

 INFO ef_testing::models::result: stReturnDataTest::modexp_modsize0_returndatasize_d4g0v0_Cancun passed: TransactionResources { starknet_resources: StarknetResources { calldata_length: 155, state_changes_for_fee: StateChangesCount { n_storage_updates: 3, n_class_hash_updates: 0, n_compiled_class_hash_updates: 0, n_modified_contracts: 3 }, message_cost_info: MessageL1CostInfo { l2_to_l1_payload_lengths: [], message_segment_length: 0 }, l1_handler_payload_size: None, n_events: 1, signature_length: 2, code_size: 0, total_event_keys: 1, total_event_data_size: 3 }, vm_resources: ExecutionResources { n_steps: 7527, n_memory_holes: 53, builtin_instance_counter: {ecdsa: 1, range_check: 117, pedersen: 173} }, n_reverted_steps: 0 }
Current memory usage: Ok(4963549184) bytes

Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' completed in 865.0 seconds
test stReturnDataTest::test_modexp_modsize0_returndatasize_d4g0v0_Cancun ... ok

So that's like ~ 10 seconds for the actual test run. Compared to infinite time now 🤔

To summarize:

The test takes around 10 mins to run, and other tests also have extremely degraded performance.

Let me know how I could help you more.

Regarding the bytes passed to the function:

bytes.len(): bytes.len(): 999188
bytes slice 128: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
bytes slice last 128: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I think it's an array of size 999188 whose first element is 1 and the rest is 0.

@azteca1998
Copy link
Collaborator

azteca1998 commented Nov 22, 2024

It seems that between the previous Cairo native version and the current one we've added proper memory management (now we deallocate stuff properly). That may be part of the performance issue. Here are the commits I suspect will have more impact in the program's performance:

79a8dcf - Fix clone and drop implementations for enums, structs and snapshots.
0b73f1c - Fix `array_get`'s drop implementation.
cff7a11 - Update to 2.8.4, add release docs, version alpha.3
82c25b3 - Add malloc tracing and fix more memory leaks.

If you're going to test them, please keep in mind it may start crashing again since there have been some bug fixes after those commits. I recommend testing using a test that was known to pass but still has had a performance regression.

@enitrat
Copy link
Author

enitrat commented Nov 22, 2024

Would there be another way to check what's taking so long? testing each one of these commits manually sounds tedious because I have to re-adapt the sequencer impl every time

@azteca1998
Copy link
Collaborator

Yes, we're working (since about an hour ago) in a tool to profile programs a bit better. I'm not sure if it'll be useful at first because we're more interested in the time taken by cairo vs the syscall handler, but it should be adaptable to this use case later on.

@enitrat
Copy link
Author

enitrat commented Dec 6, 2024

Hello,

I updated our branches and runner to use the starkware-libs/sequencer integration, and the release v0.2.4 of Cairo Native; this has not solved the issue unfortunately. I reverted our CI back to using the CairoVM for execution.

Due to a priority shift I will no longer be trying to debug this, however, the pending work has been cleared so that we can re-start these investigations in a future date by simply running our test with the 'native' feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants