This post is part of the series Microsoft Orleans - Problems & Solutions.
Problem

Upon startup, one of the silos would be marked as alive and the other would me marked as dead. This behavior would flip-flop, and the only consistent thing is that 1 silo was alive, and 1 was dead.

FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table.
Configuration
  • Two silos hosted in kubernetes.
  • No CPU limits in kubernetes deployment config file (thereby no aggressive throttling was being applied).
kind: Pod
metadata:
  ...
  labels:
    orleans/clusterId: my-cluster
    orleans/serviceId: my-service-1
    ...

--------------------------------------

kind: Pod
metadata:
  ...
  labels:
    orleans/clusterId: my-cluster
    orleans/serviceId: my-service-2
    ...
Solution

The solution lies within the kubernetes deployment config file, specifically the serviceId must be the same value across the deployment configs.

kind: Pod
metadata:
  ...
  labels:
    app: my-app
    orleans/clusterId: my-cluster
    orleans/serviceId: my-service
    ...

--------------------------------------

kind: Pod
metadata:
  ...
  labels:
    app: my-app
    orleans/clusterId: my-cluster
    orleans/serviceId: my-service
    ...
Explanation

If you think from a hierarchical view, the concept of a service is a superset of a single or multiple clusters. A service is not bounded to a process, physical machine, or even data center.

In order to keep the cluster healthy, the silos ping each other to see if they are alive. Whichever was the first silo to startup, it could not reach the other silo because the config was set so that each silo lives in a different service. Due to this misconfiguration, the silos would be in different clusters also, albeit both having the same name, my-cluster.

This led to the first silo, first suspecting the second one, and than proceeding to mark it as dead.

The silos could not reach each other, thereby one of them (whichever was started first) was marking the other dead.

The silos are part of the same service and cluster, thereby the can reach each other.

This is also a valid configuration which has multiple clusters (think geo-availability), but still the same service.


If you found this article helpful please give it a share in your favorite forums 😉.