读前思考

dubbo 服务之间调用时候，如果provider端有一个实例的机器性能很差，（长时间GC，或者磁盘打满等），这时候如何降低所有的consumer 调用该provider的概率？

一、普及

其中基于权重的负载均衡给了思路，可以调整每个provider的权重。

provider	weight
A	10
B	20
C	20
D	30

+-----------------------------------------------------------------------------------+
|          |                    |                    |                              |
+-----------------------------------------------------------------------------------+
1          10                   30                   50                             80

|-----A----|---------B----------|----------C---------|---------------D--------------|


---------------------15

-------------------------------------------37

-----------------------------------------------------------54

上面的图中一共有4块区域，长度分别是A，B，C和D的权重。使用random.nextInt(10 + 20 + 20 + 30)，从80个数中随机选择一个。然后再判断该数分布在哪个区域。比如，如果随机到37，37是分布在C区域的，那么就选择 Invoker C。15是在B区域，54是在D区域。

二、分析大概步骤

（1）、如何识别 provider 性能差？
（2）、性能差的服务，什么时间节点判定为恢复？
（3）、如何动态的给provider 赋值权重？
（4）、如何扩展dubbo的负载均衡策略？

三、详细分析

1、识别性能差

a、服务调用超时
如果消费者在调用提供者时超过了设定的超时时间，通常会抛出如下错误：

1	org.apache.dubbo.rpc.RpcException: Invoke remote method timeout. method: <methodName>, provider: <providerAddress>, cause: Waiting server-side response timeout by <timeout>ms

b、连接超时
如果消费者在尝试连接提供者时超时，可能会抛出如下错误：

org.apache.dubbo.rpc.RpcException: Failed to invoke the method <methodName> in the service <serviceName>. Tried <number> times of the providers [<providerAddress>] (1/1) from the registry <registryAddress> on the consumer <consumerAddress> using the dubbo version <version>. Last error is: Invoke remote method timeout. method: <methodName>, provider: <providerAddress>, cause: Waiting server-side response timeout by <timeout>ms

c、注册中心超时
如果消费者在从注册中心获取提供者信息时超时，可能会抛出如下错误：

org.apache.dubbo.rpc.RpcException: Failed to invoke the method <methodName> in the service <serviceName>. No provider available for the service <serviceName> from the registry <registryAddress> on the consumer <consumerAddress> using the dubbo version <version>. Please check if the providers have been started and registered.

综上：一般在调用provider，provider连接，响应长等错误，统一都会返回RpcException ，
所以，记录当前机器，在历史某段时间内的响应正确数和错误数 ，是判断性能好与差的最重要标杆。

2、什么时间段恢复？

a、业务分析

随着时间的推移，如果机器真的没有了性能问题，那该provider的表现就是，越靠近当前时间的请求窗口内，错误数占比总请求量的比例越少

b、区分时间段恢复

因为是滑动窗口，所以该滑动窗口是可以复用的，
（1）、如果当前指向第二个窗口，且窗口内无请求错误。
（2）、则判断前一个窗口，前一个窗口也无请求，则滑动到最后一个窗口，判断错误率。
（3）、离当前窗口从右往左算，每个窗口的权重分别为：1 、 4 、 16 、64
（4）、计算是否返回当前窗口，需要看当前窗口下的错误数+同时历史窗口的错误数，/ 全部历史请求的数量。

|                |                 |                |               |
+-------------------------------------------------------------------+ Sliding Windows
1                15                30               45             60
                        ^
                        |
         ^           current 1 (100)
     weight:4                               ^ 
         |                                weight:64             
         |                                  |                ^
         |                                  ------------- weight:16
         |                                                  |
         |---------------------------------------------------

3、动态赋值权重

a、dubbo默认的权重是 100，
b、如果dubbo在预热过程中，权重也会随着时间推移，不停的增加，

private int calculateWarmupWeight(int uptime, int warmup, int weight) {
    int ww = (int) ( uptime / ((float) warmup / weight));
    return ww < 1 ? 1 : (Math.min(ww, weight));
}

c、如果当前provider判断为性能差，则权重降为X（需要重点实现的）
d、最后计算多个provider的权重总值，按照权重的负载均衡思路选择一台服务。