Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect height and state data from chain #96

Merged
merged 1 commit into from
Nov 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions cmd/neofs-net-monitor/monitor.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
"github.com/nspcc-dev/neofs-net-monitor/pkg/monitor"
"github.com/nspcc-dev/neofs-net-monitor/pkg/morphchain"
"github.com/nspcc-dev/neofs-net-monitor/pkg/morphchain/contracts"
"github.com/nspcc-dev/neofs-net-monitor/pkg/multinodepool"
"github.com/nspcc-dev/neofs-net-monitor/pkg/pool"
"github.com/spf13/viper"
"go.uber.org/zap"
Expand Down Expand Up @@ -123,6 +124,11 @@ func New(ctx context.Context, cfg *viper.Viper) (*monitor.Monitor, error) {
logger.Info("neofs contract ignored")
}

mnPool := multinodepool.NewPool(sideChainEndpoints, cfg.GetDuration(cfgMetricsInterval))
if err = mnPool.Dial(ctx); err != nil {
return nil, fmt.Errorf("multinodepool: %w", err)
}

return monitor.New(monitor.Args{
Balance: balance,
Proxy: proxy,
Expand All @@ -136,5 +142,7 @@ func New(ctx context.Context, cfg *viper.Viper) (*monitor.Monitor, error) {
SideBlFetcher: sideBalanceFetcher,
MainBlFetcher: mainBalanceFetcher,
CnrFetcher: cnrFetcher,
HeightFetcher: mnPool,
StateFetcher: mnPool,
}), nil
}
22 changes: 22 additions & 0 deletions pkg/monitor/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -167,4 +167,26 @@ var (
Help: "Number of available containers",
},
)

chainHeight = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Subsystem: "neofs_net_monitor",
Name: "chain_height",
Help: "Chain height in blocks",
},
[]string{
"host",
},
)

chainState = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Subsystem: "neofs_net_monitor",
Name: "chain_state",
Help: "Chain state hash in specific height",
},
[]string{
"host", "hash",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know which block number this state corresponds to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of this metric is a block number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.464418e+06
neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.464418e+06

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure @532910 will be excited about it, but we can try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we should set some float/int value for the metric value. It can be 1, and we move the block number to the tags or use this number as a metric value. I consider block number as a value is a good choice, but I don't mind changing it, if required

},
)
)
59 changes: 59 additions & 0 deletions pkg/monitor/monitor.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,24 @@ type (
Total() (int64, error)
}

HeightFetcher interface {
FetchHeight() []HeightData
}

StateFetcher interface {
FetchState(height uint32) []StateData
}

HeightData struct {
Host string
Value uint32
}

StateData struct {
Host string
Value string
}

Node struct {
ID uint64
Address string
Expand Down Expand Up @@ -75,6 +93,8 @@ type (
sideBlFetcher BalanceFetcher
mainBlFetcher BalanceFetcher
cnrFetcher ContainerFetcher
heightFetcher HeightFetcher
stateFetcher StateFetcher
}

Args struct {
Expand All @@ -90,6 +110,8 @@ type (
SideBlFetcher BalanceFetcher
MainBlFetcher BalanceFetcher
CnrFetcher ContainerFetcher
HeightFetcher HeightFetcher
StateFetcher StateFetcher
}
)

Expand All @@ -110,6 +132,8 @@ func New(p Args) *Monitor {
sideBlFetcher: p.SideBlFetcher,
mainBlFetcher: p.MainBlFetcher,
cnrFetcher: p.CnrFetcher,
heightFetcher: p.HeightFetcher,
stateFetcher: p.StateFetcher,
}
}

Expand All @@ -130,6 +154,8 @@ func (m *Monitor) Start(ctx context.Context) {
prometheus.MustRegister(alphabetMainDivergence)
prometheus.MustRegister(alphabetSideDivergence)
prometheus.MustRegister(containersNumber)
prometheus.MustRegister(chainHeight)
prometheus.MustRegister(chainState)

go func() {
err := m.metricsServer.ListenAndServe()
Expand Down Expand Up @@ -195,6 +221,9 @@ func (m *Monitor) Job(ctx context.Context) {

m.processContainersNumber()

minHeight := m.processChainHeight()
m.processChainState(minHeight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what thoughts are behind getting the minimal height's state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes are not synchronized perfectly, we want to compare their states, some common index should be used for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, neo-go's understanding of "state" may differ from mine, ok

what if some node got stacked at some height forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exactly this situation we want to detect. In this case, alerting should notify us about different states in the block or different heights on the nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case its height shouldn't be considered for the purpose of minHeight calculation. But it can be the source of data for alerts (this difference in height).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skip if get an error.

but it wont be an error i guess. just a node that always responds with the same height

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the wrong constant value may be a good point in this case. According to the newly added statistics, the state metric will stuck for all nodes in the same value (because minimal is a constant). This is a moment when something is already going wrong. Is a statistic about height from other nodes useful at this moment? I'm asking because not familiar with these moments of the system.

On the other hand, height metric will have differences and alert about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i asked, cause i think that the current state (a node has N height, and the state at the node is X) should be shown in the metrics. any logic about a state at the minHeight is confusing, IMO. but i am not a person who looks at metrics (especially neo-go metrics). @roman-khimov is the right person to make a decision here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need something we can compare, therefore all of this logic. But nodes aren't perfectly synchornized, so exporting just the latest is a problem. Maybe we can export some set of values but that would mean a set of requests as well. Let's add an issue to improve this logic, we can add some filters for stale nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


select {
case <-time.After(m.sleep):
// sleep for some time before next prometheus update
Expand Down Expand Up @@ -498,3 +527,33 @@ func (m *Monitor) processContainersNumber() {

containersNumber.Set(float64(total))
}

func (m *Monitor) processChainHeight() uint32 {
var minHeight uint32
heightData := m.heightFetcher.FetchHeight()

for _, d := range heightData {
chainHeight.WithLabelValues(d.Host).Set(float64(d.Value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why reset states but not reset the heights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neofs_net_monitor_chain_height{host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_height{host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

The state contains a host and a hash and this hash every time is different. This fact leads us to the situation when Prometheus data grows for each request. In this case, eventually, the application will be terminated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what if one host is down? for example, it can be down even forever but a monitor instance has not been restarted since. does it mean that the metrics will still have this host's height? and its height will be constant (and that will be not a true height cause there is no host and no height at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. If the host is down, then the data will be represented with the constant value. For the monitoring dashboard, it will be a straight line. In some ways, the situations are equivalent - the host has the wrong height.
There is question to the dashboard owner and viewer - should we reset data or left old values


if minHeight == 0 || d.Value < minHeight {
minHeight = d.Value
}
}

return minHeight
}

func (m *Monitor) processChainState(height uint32) {
if height == 0 {
return
}

stateData := m.stateFetcher.FetchState(height)
chainState.Reset()

h := float64(height)

for _, d := range stateData {
chainState.WithLabelValues(d.Host, d.Value).Set(h)
}
}
127 changes: 127 additions & 0 deletions pkg/multinodepool/pool.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
package multinodepool

import (
"context"
"fmt"
"log"
"sync"
"time"

"github.com/nspcc-dev/neo-go/pkg/rpcclient"
"github.com/nspcc-dev/neofs-net-monitor/pkg/monitor"
)

// Pool collects data from each node.
type Pool struct {
endpoints []string
dialTimeout time.Duration
clients []*rpcclient.Client
}

func NewPool(endpoints []string, dialTimeout time.Duration) *Pool {
return &Pool{
endpoints: endpoints,
dialTimeout: dialTimeout,
clients: make([]*rpcclient.Client, len(endpoints)),
}
}

func (c *Pool) Dial(ctx context.Context) error {
opts := rpcclient.Options{DialTimeout: c.dialTimeout}

for i, ep := range c.endpoints {
neoClient, err := neoGoClient(ctx, ep, opts)
if err != nil {
return fmt.Errorf("neoGoClient: %w", err)
}

c.clients[i] = neoClient
}

return nil
}

func (c *Pool) FetchHeight() []monitor.HeightData {
roman-khimov marked this conversation as resolved.
Show resolved Hide resolved
var (
heights []monitor.HeightData
wg sync.WaitGroup
heightChan = make(chan monitor.HeightData, len(c.clients))
)

for _, cl := range c.clients {
wg.Add(1)

go func(cl *rpcclient.Client) {
defer wg.Done()

stHeight, err := cl.GetStateHeight()
if err != nil {
log.Printf("GetStateHeight for %s: %v", cl.Endpoint(), err)
return
}

heightChan <- monitor.HeightData{
Host: cl.Endpoint(),
Value: stHeight.Local,
}
}(cl)
}

go func() {
wg.Wait()
close(heightChan)
}()

for height := range heightChan {
heights = append(heights, height)
}

return heights
}

func (c *Pool) FetchState(height uint32) []monitor.StateData {
var (
states []monitor.StateData
wg sync.WaitGroup
stateChan = make(chan monitor.StateData, len(c.clients))
)

for _, cl := range c.clients {
wg.Add(1)

go func(cl *rpcclient.Client) {
defer wg.Done()

stHeight, err := cl.GetStateRootByHeight(height)
if err != nil {
log.Printf("GetStateRootByHeight for %s: %v", cl.Endpoint(), err)
return
}

stateChan <- monitor.StateData{
Host: cl.Endpoint(),
Value: stHeight.Hash().String(),
}
}(cl)
}

go func() {
wg.Wait()
close(stateChan)
}()

for state := range stateChan {
states = append(states, state)
}

return states
}

func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) {
cli, err := rpcclient.New(ctx, endpoint, opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context better be time-limited. But in order to make use of it you need to create a client for every set of requests (#3026). While it's suboptimal, it's still a good enough choice for the purpose, we query nodes every N seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is still not limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context is limited here when Dialing

if err != nil {
return nil, fmt.Errorf("can't create neo-go client: %w", err)
}

return cli, nil
}
Loading