-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect height and state data from chain #96
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,6 +44,24 @@ type ( | |
Total() (int64, error) | ||
} | ||
|
||
HeightFetcher interface { | ||
FetchHeight() []HeightData | ||
} | ||
|
||
StateFetcher interface { | ||
FetchState(height uint32) []StateData | ||
} | ||
|
||
HeightData struct { | ||
Host string | ||
Value uint32 | ||
} | ||
|
||
StateData struct { | ||
Host string | ||
Value string | ||
} | ||
|
||
Node struct { | ||
ID uint64 | ||
Address string | ||
|
@@ -75,6 +93,8 @@ type ( | |
sideBlFetcher BalanceFetcher | ||
mainBlFetcher BalanceFetcher | ||
cnrFetcher ContainerFetcher | ||
heightFetcher HeightFetcher | ||
stateFetcher StateFetcher | ||
} | ||
|
||
Args struct { | ||
|
@@ -90,6 +110,8 @@ type ( | |
SideBlFetcher BalanceFetcher | ||
MainBlFetcher BalanceFetcher | ||
CnrFetcher ContainerFetcher | ||
HeightFetcher HeightFetcher | ||
StateFetcher StateFetcher | ||
} | ||
) | ||
|
||
|
@@ -110,6 +132,8 @@ func New(p Args) *Monitor { | |
sideBlFetcher: p.SideBlFetcher, | ||
mainBlFetcher: p.MainBlFetcher, | ||
cnrFetcher: p.CnrFetcher, | ||
heightFetcher: p.HeightFetcher, | ||
stateFetcher: p.StateFetcher, | ||
} | ||
} | ||
|
||
|
@@ -130,6 +154,8 @@ func (m *Monitor) Start(ctx context.Context) { | |
prometheus.MustRegister(alphabetMainDivergence) | ||
prometheus.MustRegister(alphabetSideDivergence) | ||
prometheus.MustRegister(containersNumber) | ||
prometheus.MustRegister(chainHeight) | ||
prometheus.MustRegister(chainState) | ||
|
||
go func() { | ||
err := m.metricsServer.ListenAndServe() | ||
|
@@ -195,6 +221,9 @@ func (m *Monitor) Job(ctx context.Context) { | |
|
||
m.processContainersNumber() | ||
|
||
minHeight := m.processChainHeight() | ||
m.processChainState(minHeight) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what thoughts are behind getting the minimal height's state? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nodes are not synchronized perfectly, we want to compare their states, some common index should be used for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, neo-go's understanding of "state" may differ from mine, ok what if some node got stacked at some height forever? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think exactly this situation we want to detect. In this case, alerting should notify us about different states in the block or different heights on the nodes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case its height shouldn't be considered for the purpose of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
but it wont be an error i guess. just a node that always responds with the same height There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having the wrong constant value may be a good point in this case. According to the newly added statistics, the state metric will stuck for all nodes in the same value (because minimal is a constant). This is a moment when something is already going wrong. Is a statistic about height from other nodes useful at this moment? I'm asking because not familiar with these moments of the system. On the other hand, height metric will have differences and alert about it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i asked, cause i think that the current state (a node has N height, and the state at the node is X) should be shown in the metrics. any logic about a state at the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need something we can compare, therefore all of this logic. But nodes aren't perfectly synchornized, so exporting just the latest is a problem. Maybe we can export some set of values but that would mean a set of requests as well. Let's add an issue to improve this logic, we can add some filters for stale nodes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
select { | ||
case <-time.After(m.sleep): | ||
// sleep for some time before next prometheus update | ||
|
@@ -498,3 +527,33 @@ func (m *Monitor) processContainersNumber() { | |
|
||
containersNumber.Set(float64(total)) | ||
} | ||
|
||
func (m *Monitor) processChainHeight() uint32 { | ||
var minHeight uint32 | ||
heightData := m.heightFetcher.FetchHeight() | ||
|
||
for _, d := range heightData { | ||
chainHeight.WithLabelValues(d.Host).Set(float64(d.Value)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why reset states but not reset the heights? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. neofs_net_monitor_chain_height{host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_height{host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06 The state contains a host and a hash and this hash every time is different. This fact leads us to the situation when Prometheus data grows for each request. In this case, eventually, the application will be terminated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but what if one host is down? for example, it can be down even forever but a monitor instance has not been restarted since. does it mean that the metrics will still have this host's height? and its height will be constant (and that will be not a true height cause there is no host and no height at all) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right. If the host is down, then the data will be represented with the constant value. For the monitoring dashboard, it will be a straight line. In some ways, the situations are equivalent - the host has the wrong height. |
||
|
||
if minHeight == 0 || d.Value < minHeight { | ||
minHeight = d.Value | ||
} | ||
} | ||
|
||
return minHeight | ||
} | ||
|
||
func (m *Monitor) processChainState(height uint32) { | ||
if height == 0 { | ||
return | ||
} | ||
|
||
stateData := m.stateFetcher.FetchState(height) | ||
chainState.Reset() | ||
|
||
h := float64(height) | ||
|
||
for _, d := range stateData { | ||
chainState.WithLabelValues(d.Host, d.Value).Set(h) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
package multinodepool | ||
|
||
import ( | ||
"context" | ||
"fmt" | ||
"log" | ||
"sync" | ||
"time" | ||
|
||
"github.com/nspcc-dev/neo-go/pkg/rpcclient" | ||
"github.com/nspcc-dev/neofs-net-monitor/pkg/monitor" | ||
) | ||
|
||
// Pool collects data from each node. | ||
type Pool struct { | ||
endpoints []string | ||
dialTimeout time.Duration | ||
clients []*rpcclient.Client | ||
} | ||
|
||
func NewPool(endpoints []string, dialTimeout time.Duration) *Pool { | ||
return &Pool{ | ||
endpoints: endpoints, | ||
dialTimeout: dialTimeout, | ||
clients: make([]*rpcclient.Client, len(endpoints)), | ||
} | ||
} | ||
|
||
func (c *Pool) Dial(ctx context.Context) error { | ||
opts := rpcclient.Options{DialTimeout: c.dialTimeout} | ||
|
||
for i, ep := range c.endpoints { | ||
neoClient, err := neoGoClient(ctx, ep, opts) | ||
if err != nil { | ||
return fmt.Errorf("neoGoClient: %w", err) | ||
} | ||
|
||
c.clients[i] = neoClient | ||
} | ||
|
||
return nil | ||
} | ||
|
||
func (c *Pool) FetchHeight() []monitor.HeightData { | ||
roman-khimov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
var ( | ||
heights []monitor.HeightData | ||
wg sync.WaitGroup | ||
heightChan = make(chan monitor.HeightData, len(c.clients)) | ||
) | ||
|
||
for _, cl := range c.clients { | ||
wg.Add(1) | ||
|
||
go func(cl *rpcclient.Client) { | ||
defer wg.Done() | ||
|
||
stHeight, err := cl.GetStateHeight() | ||
if err != nil { | ||
log.Printf("GetStateHeight for %s: %v", cl.Endpoint(), err) | ||
return | ||
} | ||
|
||
heightChan <- monitor.HeightData{ | ||
Host: cl.Endpoint(), | ||
Value: stHeight.Local, | ||
} | ||
}(cl) | ||
} | ||
|
||
go func() { | ||
wg.Wait() | ||
close(heightChan) | ||
}() | ||
|
||
for height := range heightChan { | ||
heights = append(heights, height) | ||
} | ||
|
||
return heights | ||
} | ||
|
||
func (c *Pool) FetchState(height uint32) []monitor.StateData { | ||
var ( | ||
states []monitor.StateData | ||
wg sync.WaitGroup | ||
stateChan = make(chan monitor.StateData, len(c.clients)) | ||
) | ||
|
||
for _, cl := range c.clients { | ||
wg.Add(1) | ||
|
||
go func(cl *rpcclient.Client) { | ||
defer wg.Done() | ||
|
||
stHeight, err := cl.GetStateRootByHeight(height) | ||
if err != nil { | ||
log.Printf("GetStateRootByHeight for %s: %v", cl.Endpoint(), err) | ||
return | ||
} | ||
|
||
stateChan <- monitor.StateData{ | ||
Host: cl.Endpoint(), | ||
Value: stHeight.Hash().String(), | ||
} | ||
}(cl) | ||
} | ||
|
||
go func() { | ||
wg.Wait() | ||
close(stateChan) | ||
}() | ||
|
||
for state := range stateChan { | ||
states = append(states, state) | ||
} | ||
|
||
return states | ||
} | ||
|
||
func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) { | ||
cli, err := rpcclient.New(ctx, endpoint, opts) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This context better be time-limited. But in order to make use of it you need to create a client for every set of requests (#3026). While it's suboptimal, it's still a good enough choice for the purpose, we query nodes every N seconds. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Context is still not limited. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The context is limited here when Dialing |
||
if err != nil { | ||
return nil, fmt.Errorf("can't create neo-go client: %w", err) | ||
} | ||
|
||
return cli, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know which block number this state corresponds to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of this metric is a block number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure @532910 will be excited about it, but we can try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, we should set some float/int value for the metric value. It can be
1
, and we move the block number to the tags or use this number as a metric value. I consider block number as a value is a good choice, but I don't mind changing it, if required