Today, in many business models, product or service stability is an essential element and often the key factor in determining customer retention and company revenue.
On-call rotation is a widely used and proven practice, but it is also a major contributor to the erosion of work-life balance. While it may seem like a necessary decision, there are many details that should be considered, as cultural differences can often leave a lasting impact.
This series of articles will cover several interesting topics. These topics are not only indicators used by myself to evaluate the engineering culture of other teams during interviews, but also ways to drive progress within each team when I have greater impact in future careers.
How companies Treat On Call Rotation
The management’s stance on an issue can be considered the starting point for everything. It determines the allocation of resources and the level of humanization in the system and processes.
While software products cannot be perfect and accidents may occur after going online, this does not mean that having engineers on-call is a one-time solution. Turning a blind eye to problems after implementing an on-call system is the worst strategy.
However, these are abstract concepts that can be challenging to make progress on when raised within an organization. Additionally, the other party may whitewash such sensitive issues during an interview. To provide more concrete examples, I have outlined a few specific issues below for reference.
Supporting measures such as shift scheduling mechanism, SOP, accident classification, etc.
These various supporting measures exist to make on-call rotations more efficient and humane. I have personally experienced a significant difference early in their career based on the presence of such measures, and has since placed great importance on this aspect of the culture.
For example, when national holidays is coming and no one is willing to be on-call, there are specific issues to consider such as whether there is a maintenance SOP document to follow or if an accident requires immediate attention. These factors have a significant impact on daily work.
These measures reveal whether a company is willing to invest resources to alleviate the pressure on frontline personnel and minimize the frequency of disrupting employees during non-working hours.
Trying to prevent the same problem from happening again and again.
No one wants to be called back to work during non-working hours. However, if an incident occurs that affects someone’s life, efforts should be made to prevent it from happening again in the future.
While not all problems can be fundamentally solved, many can be mitigated through techniques such as automatic retry when encountering errors in network requests or implementing new technologies that allow for failover.
If a large number of problems continue to occur without mitigation, it may indicate that On Call is being used as a long-term solution rather than a short-term emergency measure. This can lead to increased pressure on the team.
Willingness to compensate those affected
Compensation means recognizing the impact on an employee’s life caused by On Call responsibilities and being willing to communicate constructively. Having corresponding compensation rules is a positive signal, especially when compared to organizations that believe engineers should solely be responsible for the normal operation of the system.
The most common compensation methods are compensatory time off and overtime pay, each with its own advantages and disadvantages depending on the team and individual circumstances.
For instance, requesting compensatory leave in a short period of time when the workload is heavy can be difficult, and overtime pay is not advantageous for employees with low basic salaries. The relatively low overtime pay can lead to situations where the normal limit exceeds the legal limit.
Clarifying ambiguities or exceptions to the rules can help to obtain more practical information, reduce information gaps, and align mutual expectations within the organization.
This time, we have discussed how the organizational culture reflected in On Call Rotation through some specific questions. This can serve as a reference for evaluating the culture of a new team or promoting improvements in the current organization.
While many companies may not be able to provide a more humane On Call system due to limited resources or the nature of their product, readers are encouraged to think about how to have a constructive discussion after raising concerns. After all, there should always be someone who fights for the opportunity to bring about change.
The following articles will take a pragmatic approach to the daily life of software engineering teams. They will analyze the root causes of problems and discuss the harm they can cause to an organization.