You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the issues of this repository and believe that this is not a duplicate.
Ⅰ. Issue Description
908毫秒的时候全局事务已经在另一个线程里完成,910的时候还在回滚其中一个分支,明显的并行回滚导致,并且日志里回滚了2次6882078649837270974,该事务为一个时间较长的事务,时间大概为4分钟,导致定时任务会自动将rollbacking超过2分10秒的任务拉起来重试,而此时整好决议,所以会出现并发性回滚,而在raft下由于并发,会导致对应的globalsession已经被删除了,而接着发了一个branchsession操作相关的同步消息,导致出现npe
At 908 milliseconds, the global transaction was already completed in another thread, while at 910 milliseconds, one of the branches was still being rolled back. This clearly indicates parallel rollbacks. Additionally, the log shows that the transaction with ID 6882078649837270974 was rolled back twice. This transaction was a long-running one, lasting approximately 4 minutes, which caused the scheduled task to automatically retry tasks that had been in a 'rollbacking' state for more than 2 minutes and 10 seconds. By this time, the decision was already made, resulting in concurrent rollbacks. Under Raft, due to this concurrency, the corresponding global session was already deleted, and a branch session operation-related synchronization message was sent, which led to an NPE (NullPointerException)
2024-11-12 21:34:16.911 ERROR --- [JRaft-FSMCaller-Disruptor-0] [org.apache.seata.server.cluster.raft.RaftStateMachine]
[onExecuteRaft] []: Message synchronization failure: Cannot invoke "org.apache.seata.server.session.GlobalSession.getBranch(long)" because "globalSession" is null, msgType: RELEASE_BRANCH_SESSION_LOCK
==>
java.lang.NullPointerException: Cannot invoke "org.apache.seata.server.session.GlobalSession.getBranch(long)" because "globalSession" is null
at org.apache.seata.server.cluster.raft.execute.lock.BranchReleaseLockExecute.execute(BranchReleaseLockExecute.java:35)
at org.apache.seata.server.cluster.raft.execute.lock.BranchReleaseLockExecute.execute(BranchReleaseLockExecute.java:28)
at org.apache.seata.server.cluster.raft.RaftStateMachine.onExecuteRaft(RaftStateMachine.java:333)
at org.apache.seata.server.cluster.raft.RaftStateMachine.onApply(RaftStateMachine.java:174)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
at java.base/java.lang.Thread.run(Thread.java:1583)
<==
#7005 fixed the Raft NPE issue caused by concurrency, but the issue where two-phase retries and decisions might occur simultaneously has not been addressed yet.
Original Plan:
Add local locks: This solution resolves the issue in Raft with integrated storage and computation (such as Raft and file systems), but introducing pessimistic locking due to such low-probability events can cause unnecessary performance overhead. Additionally, this solution remains ineffective in DB and Redis environments.
Add dynamic transaction compensation time: The deadtime for long transactions (defined as the interval during which a transaction might encounter anomalies in the rollback or committing state and needs an asynchronous compensation task) is set to 2 minutes and 10 seconds by default. The globaltransactional annotation can be used to specify the deadtime for each transaction at a granular level to avoid concurrency. (Currently, there is a globally configurable server.retryDeadThreshold, but its granularity is insufficient.) However, the drawback of this solution is that, in DB storage mode, it requires adding new table columns, and users must update the server after modifying the table interface.
Consensus algorithm (Raft + DB/Redis and other storage modes): This solution involves using Raft and storage modes like DB/Redis to detect whether the decision-making node is offline. The compensation task should only compensate for transactions in the "rollbacking" state on servers corresponding to an offline xid. If the server corresponding to the xid is online, no compensation is needed for that transaction, because if an exception occurs during synchronization, the transaction will change its status and will not remain in the "rollbacking" state. Therefore, if the server corresponding to the xid is alive, the "rollbacking" state can only be running on the live node, and no compensation is required.
Welcome to add more solutions and discuss them with the community.
Ⅰ. Issue Description
908毫秒的时候全局事务已经在另一个线程里完成,910的时候还在回滚其中一个分支,明显的并行回滚导致,并且日志里回滚了2次6882078649837270974,该事务为一个时间较长的事务,时间大概为4分钟,导致定时任务会自动将rollbacking超过2分10秒的任务拉起来重试,而此时整好决议,所以会出现并发性回滚,而在raft下由于并发,会导致对应的globalsession已经被删除了,而接着发了一个branchsession操作相关的同步消息,导致出现npe
At 908 milliseconds, the global transaction was already completed in another thread, while at 910 milliseconds, one of the branches was still being rolled back. This clearly indicates parallel rollbacks. Additionally, the log shows that the transaction with ID 6882078649837270974 was rolled back twice. This transaction was a long-running one, lasting approximately 4 minutes, which caused the scheduled task to automatically retry tasks that had been in a 'rollbacking' state for more than 2 minutes and 10 seconds. By this time, the decision was already made, resulting in concurrent rollbacks. Under Raft, due to this concurrency, the corresponding global session was already deleted, and a branch session operation-related synchronization message was sent, which led to an NPE (NullPointerException)
2024-11-12 21:34:16.892 INFO --- [SyncProcessing_1_1] [org.apache.seata.server.coordinator.DefaultCore] [lambda$doGlobalRollback$3] [193.193.193.37:8097:6882078649837270973]: Rollback branch transaction successfully, xid = 193.193.193.37:8097:6882078649837270973 branchId = 6882078649837270974
2024-11-12 21:34:16.908 INFO --- [SyncProcessing_1_1] [org.apache.seata.server.coordinator.DefaultCore] [doGlobalRollback] [193.193.193.37:8097:6882078649837270973]: Rollback global transaction successfully, xid = 193.193.193.37:8097:6882078649837270973.
2024-11-12 21:34:16.910 INFO --- [ServerHandlerThread_1_19_500] [org.apache.seata.server.coordinator.DefaultCore] [lambda$doGlobalRollback$3] [193.193.193.37:8097:6882078649837270973]: Rollback branch transaction successfully, xid = 193.193.193.37:8097:6882078649837270973 branchId = 6882078649837270974
Ⅱ. Describe what happened
If there is an exception, please attach the exception trace:
Ⅲ. Describe what you expected to happen
Ⅳ. How to reproduce it (as minimally and precisely as possible)
Minimal yet complete reproducer code (or URL to code):
Ⅴ. Anything else we need to know?
Ⅵ. Environment:
java -version
):uname -a
):The text was updated successfully, but these errors were encountered: