Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Debug #7628

Open
wants to merge 120 commits into
base: master
Choose a base branch
from
Open

Debug #7628

wants to merge 120 commits into from

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Aug 15, 2023

No description provided.

rphmeier and others added 30 commits February 21, 2022 17:46
* initial stab at candidate_context

* fmt

* docs & more TODOs

* some cleanups

* reframe as inclusion_emulator

* documentations yes

* update types

* add constraint modifications

* watermark

* produce modifications

* v2 primitives: re-export all v1 for consistency

* vstaging primitives

* emulator constraints: handle code upgrades

* produce outbound HRMP modifications

* stack.

* method for applying modifications

* method just for sanity-checking modifications

* fragments produce modifications, not prospectives

* make linear

* add some TODOs

* remove stacking; handle code upgrades

* take `fragment` private

* reintroduce stacking.

* fragment constructor

* add TODO

* allow validating fragments against future constraints

* docs

* relay-parent number and min code size checks

* check code upgrade restriction

* check max hrmp per candidate

* fmt

* remove GoAhead logic because it wasn't helpful

* docs on code upgrade failure

* test stacking

* test modifications against constraints

* fmt

* test fragments

* descending or duplicate test

* fmt

* remove unused imports in vstaging

* wrong primitives

* spellcheck
* inclusion: utility for allowed relay-parents

* inclusion: use prev number instead of prev hash

* track most recent context of paras

* inclusion: accept previous relay-parents

* update dmp  advancement rule for async backing

* fmt

* add a comment about validation outputs

* clean up a couple of TODOs

* weights

* fix weights

* fmt

* Resolve dmp todo

* Restore inclusion tests

* Restore paras_inherent tests

* MostRecentContext test

* Benchmark for new paras dispatchable

* Prepare check_validation_outputs for upgrade

* cargo run --quiet --profile=production  --features=runtime-benchmarks -- benchmark --chain=kusama-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/kusama/src/weights/runtime_parachains_paras.rs

* cargo run --quiet --profile=production  --features=runtime-benchmarks -- benchmark --chain=westend-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/westend/src/weights/runtime_parachains_paras.rs

* cargo run --quiet --profile=production  --features=runtime-benchmarks -- benchmark --chain=polkadot-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/polkadot/src/weights/runtime_parachains_paras.rs

* cargo run --quiet --profile=production  --features=runtime-benchmarks -- benchmark --chain=rococo-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/rococo/src/weights/runtime_parachains_paras.rs

* Implementers guide changes

* More tests for allowed relay parents

* Add a github issue link

* Compute group index based on relay parent

* Storage migration

* Move allowed parents tracker to shared

* Compile error

* Get group assigned to core at the next block

* Test group assignment

* fmt

* Error instead of panic

* Update guide

* Extend doc-comment

* Update runtime/parachains/src/shared.rs

Co-authored-by: Robert Habermeier <[email protected]>

Co-authored-by: Chris Sosnin <[email protected]>
Co-authored-by: Parity Bot <[email protected]>
Co-authored-by: Chris Sosnin <[email protected]>
* docs and skeleton

* subsystem skeleton

* main loop

* fragment tree basics & fmt

* begin fragment trees & view

* flesh out more of view update logic

* further flesh out update logic

* some refcount functions for fragment trees

* add fatal/non-fatal errors

* use non-fatal results

* clear up some TODOs

* ideal format for scheduling info

* add a bunch of TODOs

* some more fluff

* extract fragment graph to submodule

* begin fragment graph API

* trees, not graphs

* improve docs

* scope and constructor for trees

* add some test TODOs

* limit max ancestors and store constraints

* constructor

* constraints: fix bug in HRMP watermarks

* fragment tree population logic

* set::retain

* extract population logic

* implement add_and_populate

* fmt

* add some TODOs in tests

* implement child-selection

* strip out old stuff based on wrong assumptions

* use fatality

* implement pruning

* remove unused ancestor constraints

* fragment tree instantiation

* remove outdated comment

* add message/request types and skeleton for handling

* fmt

* implement handle_candidate_seconded

* candidate storage: handle backed

* implement handle_candidate_backed

* implement answer_get_backable_candidate

* remove async where not needed

* implement fetch_ancestry

* add logic for run_iteration

* add some docs

* remove global allow(unused), fix warnings

* make spellcheck happy (despite English)

* fmt

* bump Cargo.lock

* replace tracing with gum

* introduce PopulateFrom trait

* implement GetHypotheticalDepths

* revise docs slightly

* first fragment tree scope test

* more scope tests

* test add_candidate

* fmt

* test retain

* refactor test code

* test populate is recursive

* test contiguity of depth 0 is maintained

* add_and_populate tests

* cycle tests

* remove PopulateFrom trait

* fmt

* test hypothetical depths (non-recursive)

* have CandidateSeconded return membership

* tree membership requests

* Add a ProspectiveParachainsSubsystem struct

* add a staging API for base constraints

* add a `From` impl

* add runtime API for staging_validity_constraints

* implement fetch_base_constraints

* implement `fetch_upcoming_paras`

* remove reconstruction of candidate receipt; no obvious usecase

* fmt

* export message to broader module

* remove last TODO

* correctly export

* fix compilation and add GetMinimumRelayParent request

* make provisioner into a real subsystem with proper mesage bounds

* fmt

* fix ChannelsOut in overseer test

* fix overseer tests

* fix again

* fmt
* BEGIN ASYNC candidate-backing CHANGES

* rename & document modes

* answer prospective validation data requests

* GetMinimumRelayParents request is now plural

* implement an implicit view utility for backing subsystems

* implicit-view: get allowed relay parents

* refactorings and improvements to implicit view

* add some TODOs for tests

* split implicit view updates into 2 functions

* backing: define State to prepare for functional refactor

* add some docs

* backing: implement bones of new leaf activation logic

* backing: create per-relay-parent-states

* use new handle_active_leaves_update

* begin extracting logic from CandidateBackingJob

* mostly extract statement import from job logic

* handle statement imports outside of job logic

* do some TODO planning for prospective parachains integration

* finish rewriting backing subsystem in functional style

* add prospective parachains mode to relay parent entries

* fmt

* add a RejectedByProspectiveParachains error

* notify prospective parachains of seconded and backed candidates

* always validate candidates exhaustively in backing.

* return persisted_validation_data from validation

* handle rejections by prospective parachains

* implement seconding sanity check

* invoke validate_and_second

* Alter statement table to allow multiple seconded messages per validator

* refactor backing to have statements carry PVD

* clean up all warnings

* Add tests for implicit view

* Improve doc comments

* Prospective parachains mode based on Runtime API version

* Add a TODO

* Rework seconding_sanity_check

* Iterate over responses

* Update backing tests

* collator-protocol: load PVD from runtime

* Fix validator side tests

* Update statement-distribution to fetch PVD

* Fix statement-distribution tests

* Backing tests with prospective paras #1

* fix per_relay_parent pruning in backing

* Test multiple leaves

* Test seconding sanity check

* Import statement order

Before creating an entry in `PerCandidateState` map
wait for the approval from the prospective parachains

* Add a test for correct state updates

* Second multiple candidates per relay parent test

* Add backing tests with prospective paras

* Second more than one test without prospective paras

* Add a test for prospective para blocks

* Update malus

* typos

Co-authored-by: Chris Sosnin <[email protected]>
* Provisioner changes for async backing

* Select candidates based on prospective paras mode

* Revert naming

* Update tests

* Update TODO comment

* review
* Provisioner changes for async backing

* Select candidates based on prospective paras mode

* Revert naming

* Update tests

* Update TODO comment

* review
…o handle versioned packets (#5991)

* BEGIN STATEMENT DISTRIBUTION WORK

create a vstaging network protocol which is the same as v1

* mostly make network bridge amenable to vstaging

* network-bridge: fully adapt to vstaging

* add some TODOs for tests

* fix fallout in bitfield-distribution

* bitfield distribution tests + TODOs

* fix fallout in gossip-support

* collator-protocol: fix message fallout

* collator-protocol: load PVD from runtime

* add TODO for vstaging tests

* make things compile

* set used network protocol version using a feature

* fmt

* get approval-distribution building

* fix approval-distribution tests

* spellcheck

* nits

* approval distribution net protocol test

* bitfield distribution net protocol test

* Revert "collator-protocol: fix message fallout"

This reverts commit 07cc887.

* Network bridge tests

Co-authored-by: Chris Sosnin <[email protected]>
* remove max_pov_size requirement from prospective pvd request

* fmt
* add compatibility type to v2 statement distribution message

* warning cleanup

* handle compatibility layer for v2

* clean up an unimplemented!() block

* circulate statements based on version

* extract legacy v1 code into separate module

* remove unimplemented

* clean up naming of from_requester/responder

* remove TODOs

* have backing share seconded statements with PVD

* fmt

* fix warning

* Quick fix unused warning for not yet implemented/used staging messages.

* Fix network bridge test

* Fix wrong merge.

We now have 23 subsystems (network bridge split + prospective
parachains)

Co-authored-by: Robert Klotzner <[email protected]>
* Fix backing tests

* Fix warnings.
* Draft collator side changes

* Start working on collations management

* Handle peer's view change

* Versioning on advertising

* Versioned collation fetching request

* Handle versioned messages

* Improve docs for collation requests

* Add spans

* Add request receiver to overseer

* Fix collator side tests

* Extract relay parent mode to lib

* Validator side draft

* Add more checks for advertisement

* Request pvd based on async backing mode

* review

* Validator side improvements

* Make old tests green

* More fixes

* Collator side tests draft

* Send collation test

* fmt

* Collator side network protocol versioning

* cleanup

* merge artifacts

* Validator side net protocol versioning

* Remove fragment tree membership request

* Resolve todo

* Collator side core state test

* Improve net protocol compatibility

* Validator side tests

* more improvements

* style fixes

* downgrade log

* Track implicit assignments

* Limit the number of seconded candidates per para

* Add a sanity check

* Handle fetched candidate

* fix tests

* Retry fetch

* Guard against dequeueing while already fetching

* Reintegrate connection management

* Timeout on advertisements

* fmt

* spellcheck

* update tests after merge
rphmeier and others added 20 commits July 13, 2023 17:52
* shared: fix acquire_info

* backwards-compat test for prospective parachains

* same relay parent is allowed
* return candidates hash from prospective parachains

* update provisioner

* update tests

* guide changes

* send a single message to backing

* fix test
* revert to old `handle_new_activations` logic

* gate sending messages on scheduled cores to max_depth >= 2

* fmt

* 2->1
* fix a bug in backing

* add some more logs

* prospective parachains: take ancestry only up to session bounds

* add test
Signed-off-by: Andrei Sandu <[email protected]>
* update metric name

* update some metric names

* avoid cycles when creating fake candidates

* make undying collator more friendly to malformed parents

* fix a bug in malus
@bkchr bkchr changed the base branch from rh-async-backing-feature to master August 15, 2023 20:00
@bkchr bkchr requested review from chevdor and a team as code owners August 15, 2023 20:00
@paritytech-ci paritytech-ci requested review from a team August 15, 2023 20:00
@bkchr bkchr changed the base branch from master to rh-async-backing-feature August 15, 2023 20:03
@rphmeier
Copy link
Contributor

rphmeier commented Aug 15, 2023

what I actually think is happening is probably not an issue with Cumulus, but rather that
a) the test somehow ends up building a relay chain where one of the candidates is bad
b) in master, Cumulus can still build upon it because Malus was using a valid but previous head-data
c) in the feature branch, Cumulus cannot build upon it because Malus is creating invalid head-data for the bad candidates

This would be confirmed by the following log pattern in this PR:

TRACE "producing candidate"
No error "Could not decode the head data" (it may decode properly but not be valid)
No INFO "starting collation"

i.e. taking the None branch after check_block_status here.

https://github.com/paritytech/cumulus/blob/7f0880627bedd413305741efac9573a306d273a1/client/collator/src/lib.rs#L89-L104

(of course we encountered a heisenbug here which is that the test passed after adding logging to the collator to confirm this)

while we do see the log
cumulus-collator: [Parachain] Skipping candidate production, because block is unknown. block_hash=0xabfa0028d067680b6864816fba225254bd32d04f409858eb9fb500ec47f01711 which confirms the theory, somehow the collator started later than normal and this caused the test to pass.

@rphmeier
Copy link
Contributor

cc @ordian

@rphmeier
Copy link
Contributor

The same behavior is observable, for instance in #7627

ERROR tokio-runtime-worker test_parachain_undying_collator: Unable to build on top of HeadData { number: 3, parent_hash: [184, 32, 130, 93, 163, 77, 113, 94, 100, 34, 123, 240, 238, 206, 140, 39, 135, 199, 113, 49, 166, 226, 101, 221, 105, 174, 75, 150, 223, 41, 31, 184], post_state: [125, 243, 151, 142, 253, 68, 243, 89, 136, 140, 235, 169, 139, 174, 0, 42, 22, 216, 157, 126, 42, 223, 45, 83, 144, 103, 102, 193, 237, 233, 244, 212] }: StateMismatch  

Base automatically changed from rh-async-backing-feature to master August 18, 2023 16:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants