2021-07-14 19:50:56
Dear all,
The last major upgrade v8.0 (https://github.com/Zilliqa/Zilliqa/releases/tag/v8.0.0) was feature packed. It reduced the block time, revised the consensus protocol, introduced remote state reads for Scilla, adjusted priorities for new miners, and included several other core optimizations and bug fixes. However, since that upgrade, the network has at times become unstable. This has required the core team to intervene on several occasions to introduce patches, resulting in unwanted network downtime.
We understand the frustration and inconvenience that the upgrade has caused to token holders, exchanges, wallet providers, miners and other relevant parties. The intention of this post is to share details around this topic, our general development and testing process to bring transparency to you all and steps we plan to take to reduce the chances of such events as we move forward.
As you already know, the underlying tech at Zilliqa is cutting-edge and we are constantly innovating and adding new features to the protocol. Our approach to introducing these new features follows established industry-standards of writing unit tests and running correctness and stress tests. Once the changes and unit tests are fully ready, we run these new changes on a private small-scale network for a period of time, followed by a large-scale integration at the mainnet-scale and then deploy the changes on a public testnet open for all to interact with. If a bug is found during testing on any of the networks, the bug is fixed and we start from the first step, by writing a unit test to capture the bug, deploying the fixes on a private network and so on.
Many of these cutting-edge innovations in engineering and development however present a key challenge. Until these changes go live on a production mainnet system, it is impossible to be certain that the introduced changes are free of bugs. While we discourage testing anything in production, sometimes it is clear that the real test is always in production.
We would also like to highlight that some issues are beyond our codebase and control as there could be dependencies on external systems, libraries and infrastructures. The most recent memory issue is a good example of this (https://www.algolia.com/blog/engineering/when-allocators-are-hoarding-your-precious-memory/). In many cases, we are forced to push a fix upstream to well known open source packages or have a workaround if there is no viable solution.
Guaranteeing that a given piece of software is free of bugs is close to impossible but we certainly try to ensure that. Despite these challenges, we are committed to constantly innovate and add new features to the chain with minimal disruption and bugs.
To this end and given the recent downtime, we are also taking our testing process a step further, by using formal verification tools to model the system and formally verify the generated model and whenever possible the implementation. Even though formal verification tools require extensive man hours, they are extremely helpful in verifying correctness, reliability and dependability of mission critical software systems. The goal is to combine unit testing and formal verification which complement each other in detecting any design or implementation issues with critical parts of the codebase. In order to mitigate the issues further, we are planning to reduce the frequency of the network upgrade that contains new features which will give more time for the network to test features in the wild. We will also reduce the number of features that will go in a given release to minimize the size of the new code that gets exposed.
3.0K viewsAmrit Kummer, edited 16:50