Nerd alert: TLA+ is helping mold and debug the cloud with math

A new project from the Linux Foundation is looking to let more people in on the best kept secret in the cloud world: TLA+. And already the initiative – known as the TLA+ Foundation – has the backing of major cloud players Amazon Web Services (AWS), Microsoft and Oracle. Though not listed as a founding member, Google was also mentioned in the TLA+ Foundation launch release, as was NVIDIA. 

Before we continue, we have to issue a nerd alert. Things are about to get dense, but we’ll break it down as best we can.  

What is TLA+?

In the simplest possible terms, TLA+ is a mathematics-based programming language and toolset which is commonly used to model, test and debug complex systems. TLA+ is built to meet the needs of distributed systems for which other testing tools are inadequate. It can also help reduce the amount of code required for a given system through mathematical optimization. 

Both of those things are great news for cloud providers with already sprawling and rapidly growing distributed storage and compute systems. 

TLA+ was developed by computer scientist and mathematician Leslie Lamport and was first published in the 1990s. Lamport is now a distinguished scientist at Microsoft Research and absolutely looks the part. 

Now, though, it will be up to the Linux Foundation to shepherd TLA+ going forward under the guise of the TLA+ Foundation. In addition to promoting adoption of the language, the new Foundation will also provide educational and training resources, fund research and ensure the continuous development of TLA+ via language enhancements and the incorporation of user feedback.  

Singing TLA+ praises

There’s a reason why AWS, Microsoft and Oracle have all signed on as founding members for the new project. While it may not be as well-known a language as Java, Python or C++, behind the scenes TLA+ has actually played a key role in the cloud’s development to date, in large part because of its utility when it comes to streamlining and debugging distributed systems. 

AWS has used TLA+ since 2011 to help build the distributed systems it uses to store and process data. According to a paper AWS published in 2014, a key benefit of TLA+ is its ability to help engineers sort through very large, complex issues. That includes verifying interactions between algorithms across distributed systems.

“TLA+ has added significant value, either preventing subtle serious bugs from reaching production, or giving us enough understanding and confidence to make aggressive performance optimizations without sacrificing correctness,” the paper’s author, former AWS Principal Engineer Chris Newcombe wrote at the time. 

Tim Rath, who is now senior principal technologist at AWS, detailed the company’s use of TLA+ for testing its distributed systems at QCon in 2015. You can check that out here, with relevant bits starting around 36 minutes in. 

In 2021, AWS added Automated Reasoning to its list of research priorities, looking to apply this to its Simple Storage Service (S3). Formal specifications and model checking – such as that enabled by TLA+ – play a key role in automated reasoning.  

Byron Cook, VP and distinguished scientist at AWS, said in a statement this week that by joining the TLA+ Foundation, the company aims to “support the advancement of TLA+ tooling and further improve the state of the art in distributed systems design.” 

Interestingly, shortly after the 2014 paper was published, Newcombe left AWS for Oracle, which has also sung TLA+’s praises

In a blog published last year, Oracle noted it uses TLA+ to “help catch subtle bugs” in its systems before they cause damage in public deployments. With TLA+’s help, Oracle can “ensure that our distributed system designs are correct under every possible thread interleaving, network fault and power outage scenario.” At the time, it said TLA+ had been applied to strengthen more than 20 of its systems, including its Object Storage service. 

Microsoft, too, has long been using TLA+, according to Lamport’s website about TLA+. This stated with occasional applications as early as 2004, with its use becoming more regular around 2015. A video from 2018 discusses how TLA+ can be applied to Microsoft’s Cosmos DB service.  

Citing Microsoft’s Principal Engineering Manager Cheng Huang, Lamport’s website notes the tech giant doesn’t require knowledge of TLA+ for job candidates – at least not yet – but it does weigh in their favor. 


Want to learn more about data strategies for the cloud? Register for our Cloud Data Center Strategies virtual event here.