Aleph.im Scoring Update - November 2024

TL;DR: Aleph.im is updating it’s scoring algorithm in order to provide better feedback to uses and node operators.

Context

Aleph.im is a decentralized cloud platform that operates through a network of independent servers, or nodes, which offer cloud computing resources to its users.

To ensure high-quality service, it’s crucial that all nodes perform reliably, undergo regular updates and maintenance, and maintain optimal uptime.

The reliability and performance of nodes within the aleph.im network are assessed using the following principles:

  1. Metrics: Measurements of the performance and reliability of the nodes distributed worldwide.
  2. Scores: Global performance and availability indicators computed from the metrics.
  3. Rewards: Rewards are designed to incentivize the community by using scores to encourage the provision of well performing nodes.

See the documentation on Node Reliability for details.

Current situation

Scores are reproducibly generated from the metrics and span a scale from 0 to 100%.

Currently, a node’s score is calculated using the previous two weeks’ worth of observation data. The bottom 5% of metrics are excluded to minimize noise and accommodate server maintenance.

This approach has been working well so far, but has two main drawbacks:

  1. After resolving issues on a node, node operators cannot immediately determine if their corrective actions will enhance their scores. They must wait up to two weeks for their scores to recover.
  2. The reference period is arbitrarily set. Users cannot differentiate nodes that have been working well all along from those that were unreliable before the observation window began.

Metrics update

In October, an update improved metric reliability by decreasing the number of concurrent measurements and expanding measurement locations across different continents. This change significantly reduced noise and opened up new possibilities for scoring methods.

New scoring

A new approach to scoring was discussed earlier this year in the form of AEP-014: Exponential decay in metrics influence on the score.

This approach has been refined and implemented using a Geometric distribution, and by moving most of the logic inside SQL queries that anyone can run directly in the database of a Core Channel Node.

Formula

The new scoring is computed as follows:

  1. A multiplier is computed for every hour in the past, based on a combination of two geometric distributions.

    $$geometric_pmf(p, x) = (1 - p) * p ^ (x - 1)$$

    $$multipler = geometric_pmf(p1, hours_difference) * m1 + geometric_pmf(p2, hours_difference) * m2$$

    Here, $p1$ is adjusted to emphasize recent metrics, while $p_2$ is tuned to favour older metrics.

    Meanwhile, $m1$ and $m2$ serve as proportional multipliers to ensure the total remains within the range $[0…1]$.

  2. For every hour, a partial score based on the metrics measured that hour. When multiple metrics are present, the 67th percentile is used (the worst third is ignored). The partial scores are multipled together and fractional exponents remove the bias from the multiplication. When the version of the software running that hour was invalid, the partial score is set to zero.

  3. The multiplier and partial scores are multiplied for every hour of the last years.

    $$score = \sum_{h=-1}^{history} multiplier(h) * partial_score(h) * version_valid * tuning$$

    The $tuning$ a number tuned such that most nodes have a score between 80% and 100%.

Impact

The value of the geometric distribution decreases over time but never reaches zero. The multiplier in the function above follows the following trend:

The most recent metrics have the highest impact on the score, and nodes recovring from a downtime see their score increasing rapidly.

Since the multiplier never reaches zero, the highest nodes are assigned to old nodes operating well with the least downtime.

The yellow vertical bar corresponds to two weeks (336 hours). The theoritical minimal period for a new node to reach a score of at least 80%, the minimum to receive rewards, is close from those two weeks, with observations from real nodes mostly around the two weeks mark.

Availability

The new scores are now published for feedback from the community on the channel aleph-scoring with the type aleph-scoring-scores-beta by the address 0x4Ec8b55e73F5f32118a90B8FD555706bD5dd42e7.

https://explorer.aleph.im/address/ETH/0x4Ec8b55e73F5f32118a90B8FD555706bD5dd42e7

Account page copy

A preview of the account page with the new scores is available here:

https://beta.account.aleph.im/

or using the archive:
https://bafybeievifbf7jzmopngeirnyfnn7yf55riawqhrz52i4kxttuwxbepruu.ipfs.aleph.sh/

HTTP API

We are developing an HTTP API for Core Channel Nodes to provide quick score estimations, eliminating the need to wait for published updates.

Next steps and deployment

We invite feedback from users and node operators over the next ten days. Based on input received, we plan to deploy this new scoring system across the network starting December 17, 2024.