12-10-2021 | | By Robin Mitchell
The recent outage in services from Facebook, including Messenger and WhatsApp, caused worldwide challenges in messaging others. Still, it demonstrates an excellent example of why the unification of platforms and standardisation of protocols can be problematic. What challenges do unification and standards provide, why does the Facebook outage demonstrate the challenges of unification and standards, and how will engineers of the future account for these challenges?
Standardisation is a practice that dates back to early man; the Romans standardised their military factions with the same equipment and weapons used amongst soldiers while ruling powers have overseen the use of standardised weights used across markets to ensure fairness. In the engineering world, the standardisation of components, tools, and equipment has enabled engineers to develop projects faster while allowing easy repairs via alternative parts.
Standardisation became particularly important in the computing world during the computing boom of the 1980s. The high demand for computers saw opportunities from many hundreds of companies to make their own machines, but this resulted in a market with no standardised hardware or software. As such, computers from different manufacturers could not communicate easily or even read files from different platforms. Thus, the introduction of ISA and the IBM PC helped create a platform where computers manufactured from competitors could run the same software and read the same files.
However, standardisation and unification are not all rainbows and flowers, as any system designed to operate identically with interchangeable parts is also highly vulnerable to mistakes. For example, most PCs on the market use the Windows operating system. While this provides tools for developers and a huge market to target, any bugs in the operating system can make all users vulnerable. In fact, the widespread use of Windows has made it a prime target for hackers; if one can be hacked, then almost all can be hacked using the same method.
Recently, Facebook suffered from a major outage that lasted over 6 hours, resulting in billions around the planet being able to use Facebook and its other services. When one site goes down, other services are rarely affected, but in the case of Facebook, other services they offer, including WhatsApp and Messenger, also stopped functioning.
The specific reason behind the blackout has not been described in great detail, but what is known is that the backend services were given an incorrect configuration due to a tool that checks for bad commands failing. What made matters worse for engineers is the resulting failure in the tool that checks for bad commands also helps with debugging.
Losing access to Facebook for a few hours is inconvenient for many, but having multiple messaging platforms fail is potentially serious. The unification of Messenger and WhatsApp onto a single platform created by Facebook means that any failure on Facebook can cascade into these other services, and WhatsApp and Messenger are used by the vast majority of people.
While Facebook unifying all the variables can be beneficial for engineers when making improvements and servicing, any underlying bugs or errors can propagate to all services reliant on the unified system. Furthermore, combining multiple services that provide the same functionality onto a single platform causes havoc when alternative services are unavailable.
Protecting services and systems from such blackouts is no small feat when said service is used by billions around the world. The very reason why Facebook moved all of these services onto a single platform was so that repairs and changes across all services can be done with ease and speed. Fragmenting the various services onto their own data centres and platforms would require more engineers, hardware, and investment.
However, fragmentation is often the only solution to help create systems that cannot interfere with each other. In the case of Facebook, having an identical data centre on standby whose sole purpose is to take over the primary servers is one potential solution. But, this would require an entire data centre that would spend most of its time unused, and this would be extraordinarily expensive to operate.
Another potential solution is for a miniature version of Facebook services to be used as a testing ground before any changes are made. Assuming that all staff at Facebook use Facebook services, this miniature data centre could be responsible for handling all staff, and changes that need to be made can be done on this platform first. As such, only staff are affected by errors, which helps prevent catastrophic changes from being made to the main datacenters.
Overall, engineers need to carefully consider how they construct platforms used by multiple services, as having a singular platform that underpins everything can go horribly wrong very quickly. The standardisation of services and APIs can be beneficial when performing maintenance, but creating fragmentation and various versions of APIs with their own unique methods can help to protect against bugs that could potentially knock out all services. But in the case of Facebook, having all messenger services run on the same platform is a mistake that should have not been allowed for the simple reason that many customers may use both services and only these services.