AEP-014: Exponential decay in metrics influence on the score

Context

For each node, a global score is computed from the metrics. The global score is computed as a value between 0 and 1, and is rounded to a percentage when displayed.

This score is used to advertise the reliability of a node to users of the network and for the distribution of rewards during the bootstrap phase of the network.

The score of a node is based on the reference period, currently the previous two weeks of metrics, and is published every four hours.

Computing the score relies on the use of percentiles:

  • the 25% percentile exposes the value where 25% of measurements are lower, 75% are higher than this value. This represents “good” measurements.
  • the 95% percentile exposes the highest value after discarding the worst 5% measurements. This represents “worst acceptable” measurements and can be related to SLA.

Situation

An operational issue can bring a node down for multiple hours or days. When the period of downtime is longer than 5% of the reference period, the score of the node falls to zero.

The score of the node remains at zero as long as the duration of all missing or bad measurements exceed 5% of the reference period.

Problem

  1. After fixing the problems that resulted in a downtime, a node operator lacks information about whether his corrective action will result in the score being restored in the future.

  2. The duration of the reference period is arbitrary and does not allow to distinguish reliable nodes from nodes that have been unreliable in the past, before the start of the reference period.

Proposal

1. Exponential decay

Apply an exponential decay to the importance of measurements in the value of the current score of the node.

In case of incident, the impact would be be lower for older metrics than for newer ones.

2. No cutoff

Adjust the scoring formulas to always provide a score, even if it is very low, instead of the cutoff to zero above 5% of bad or missing metrics.


Combined, these two measures should help the score of a node increase over time after an incident until it reaches its maximum value.

Node operators and users would therefore have an indicator that the reliability of the node is increasing over time and expected to go back to normal.

2 Likes

I was wondering if we could add a property to play with the steepness as well, so we can easily adjust it in the future if needed.

import numpy as np
import matplotlib.pyplot as plt

# Define the exponential decay function
def exponential_decay(t, k):
    return 100 * np.exp(-k * t)

# Time variable from 0 to 10
t = np.linspace(0, 10, 500)

# Example values for k
k_values = [0.2, 0.5, 1.0, 2.0]

# Plotting the exponential decay for different values of k
plt.figure(figsize=(10, 6))

for k in k_values:
    plt.plot(t, exponential_decay(t, k), label=f'k={k}')

plt.title('Exponential Decay with Different Steepness')
plt.xlabel('Time (t)')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()

Thank you for addressing this.

I personally like suggestion 1 (Exponential decay), including the thought of configurable steepness.

As a possible alternative to 2 (No cutoff), I’d like to suggest repurposing the traffic light shown on the accounts page. If I get it correctly, as of today, the traffic light interpretation is:
green = earning rewards
red = not earning rewards

A possible alternative would be
green = At the last measuring point, the node was okay. The score will improve if the node remains like this.
red = At the last measuring point, the node was in a failure state. The score will go down if the node remains like this.

This usage of the traffic light would always provide immediate feedback if the node is okay with its current config or not - without need to observe the score over a longer period of time to draw conclusions from it.

If the node earns rewards or not shall be depending on a specified score threshold.

Regardless of what changes may be done to the calculations, I think having the actual score displayed would be incredibly helpful rather then just zeroing it out like can happen now.

For the calculation, would it be possible to essentially invert the current method of calculating a score so it’s counting down from 100% rather than up from 0%? This assumes everything looks okay and there aren’t any obvious issues with the setup and metrics. I feel like this could eliminate that initial ramp-up period that can be days or weeks with no score/rewards.

Thanks for your feedback !

I was wondering if we could add a property to play with the steepness as well, so we can easily adjust it in the future if needed.

Calibration of the steepness and exponent function will be required, however the implementation must be delegated as much as possible to the database where the data resides.

green = At the last measuring point, the node was okay. The score will improve if the node remains like this.

This suggestion to display the trend instead of the value is interesting and it would be interesting to add this information to the UI.

I have a few remarks on this topic however:

  • a. The score itself is however also relevant, as a node with a bad score receives no reward while the score is low, no matter whether the trend is good or bad.
  • b. I would not want a node to appear bad if it is under short maintenance (ex: a few minutes to reboot after a kernel security update).
  • c. Metrics can be quite noisy, so looking only at the latest metrics message would create too much variations to be useful.

For the calculation, would it be possible to essentially invert the current method of calculating a score so it’s counting down from 100% rather than up from 0%?

I assume that there is a misunderstanding on how metrics work. The metrics do not start counting from 0% or 100%.

When node endpoint does not respond for a specific metric, the value is null. However when no measurement is made on the node, no metric is written and the score is not impacted.

Is this clearer ?