Earlier this year, the single greatest site reliability engineering (SRE) lesson unfolded itself out in space. Last week we saw the very first, better-than-even-expected images from the James Webb Space Telescope or JWST.
After ten years of design and build on a $9 billion budget, this was an effort in testing 344 single points of failure — all before deploying to production, with the distributed system a million miles and one month away.
Needless to say, there are a lot of reliability lessons to be learned from this endeavor. At his WTF is SRE talk last month, Robert Barron brought his perspective as an IBM SRE architect and a hobby space photographer to uncover the patterns of reliability that enabled this feat. And how NASA was able to trust its automation so much that it’d release something with no hopes of fixing it. It’s a real journey into observability at scale.
Universe-Scale Functional and Nonfunctional Requirements
“It’s a great platform for demonstrating site reliability engineering concepts because this is reliability to the extreme,” Barron said of the James Webb Space Telescope. “If something goes wrong, if it’s not reliable, then it doesn’t work. We can’t just deploy it again. It’s not something logical, it’s something physical that has to work properly and I think there are a lot of lessons and a lot of inspiration that we can take from this work into our day-to-day lives.”
After 30 years of amazing photos from the Hubble Telescope, there was a demand for new business and technical capabilities, including to be able to see through and past clouds as they are created.
Computer, enhance! Compare the same target — seen by Spitzer & in Webb’s calibration images. Spitzer, NASA’s first infrared Great Observatory, led the way for Webb’s larger primary mirror & improved detectors to see the infrared sky with even more clarity: https://t.co/dIqEpp8hVi pic.twitter.com/g941Ug2rJ8
—NASA Webb Telescope (@NASAWebb) May 9, 2022
When designing the Webb telescope, the design engineers kicked off with the functional requirements, which in turn drove a lot of non-functional requirements. For instance, it needed to be much more powerful and larger than Hubble, but to achieve that it needed a significantly larger mirror. However, an operational constraint arose that the mirror is so large that it doesn’t fit into any rocket, so it needed to be broken up into pieces. The non-functional requirement became to create a foldable mirror. A solution arose to break the mirror up into smaller hexagons, which can be aligned together to form a honeycomb-shaped mirror.
The second non-functional requirement of the JWST was to go beyond Hubble in not only seeing invisible light, but in seeing hot infrared light. But, to be accurate, the mirror needs to keep cold. “Not just colder, but we need to be able to control the temperatures. Exactly. Because any variation and we’re going to look at something and think ‘Oh, this is a star. This is a galaxy. Not that’s just something there on Webb itself, which is slightly cooler or warmer than it should be,” Barron explained.
Unlike Hubble which orbits the Earth, Webb is unable to orbit because then its temperatures would vary greatly in sun and shade. Plus, it needs to be much farther away from earth than Hubble has ever gone. With this in mind, the controls and antennas face Earth and the telescope faces away with the honeycomb set of mirrors that reflect into a second set of mirrors which then sends the images back to the cameras, which are located in the middle of the honeycomb mirrors . Then behind it is a massive set of sunshades that work to control the temperature of the telescope.
When Overhead Costs Soar
When NASA decided back in 1995 to make this next-generation space telescope, the agency assumed it’d cost about a billion dollars. In 2003, they started to design it, “and they realized that it’s not just scaling up Hubble, we need technological breakthroughs — the foldable mirrors, precise control of the temperature, the unfurling of the heat shields, and so on,” said Barron . Over the next four years of high-level design, they moved the budget to $3.5 billion and planned on another billion for a decade of operations.
Then between 2007 and 2021, NASA dove into the design, build and test phase of what was named the James Webb Space Telescope.
“Like good SREs we test and, because we have ten technological breakthroughs that we need to achieve, we have a lot of failures,” Barron said. “So we retest and fail, and retest and fail. And this takes a lot of time, and the project is nearly canceled many times. And eventually it costs $9.5 billion dollars just to build it. And that $1 billion that we thought would be enough to operate for 10 years is only going to be enough to operate for five years.”
All things considered, the JWST was launched in December of last year, kicking off its operation, and what Barron referred to as “pirouetting and ballet moves” through space.
“You can see that over a period of 13 days that the telescope, like a butterfly, opens up, spreads its wings, and started reporting home. And then starts going further away from Earth until it reaches the location where it will remain for the next decade,” he explained. This journey took a total of 30 days.
As of the WTF is SRE event that Barron spoke at the end of April, the JWST was considered mid-deployment, “before reaching production we’re doing the final tests before we can say that the system is working and can start giving actual scientific date.”
During this deployment phase, there are so many components and pieces moving and changing, it uncovered many points of failure — 344 to be exact.
“Webb is famous for having over 300 single points of failure during this process of 30 days, each of which has to go perfectly, each of which if the fails, the entire telescope will not be able to function,” Barron explained.
When those first exceptional photos came back, discovering new, fainter galaxies, was it luck or a feat of extreme site reliability engineering?
“How did NASA reach the point where they could send $10 billion worth of satellite out into space without being able to fix anything without being able to reach out with an astronaut to say, ‘Oh, I need to move something, I need to restart something, I need to do something manual.’ How can the system be completely fully automated? And can I trust that no dragons will come from outer space and do something to the telescope which will cause it to fail?”
—Robert Barron @FlyingBarron
Redundancy. repairability. Reliability.
You could say this is more than a leap of faith. That trust that NASA had in all this working properly, Barron believes, comes from its decades-long history of sending crafts into space, which is grounded in the values of:
Both the Voyager spacecraft that went to Jupiter, Saturn, Uranus, and Neptune and the Mars Rover were actually sets of identical twin crafts, in case one failed. Similarly, constellations of satellites work in tandem as fail-safes. This redundancy has long been embraced by NASA, but wasn’t the option with the JWST price tag.
When redundancy is out, NASA next reaches for repairability. The Hubble Telescope has been repaired and upgraded multiple times for both fixes and preventive maintenance. And, according to Barron, 50% of the astronaut time on the International Space Station is actually spent on canvas.
“If the astronauts left the International Space Station, then, in a very short period of time, it would just break down and they’d be forced to send it back down into the atmosphere to burn up,” he explained.
But, again, the non-functional requirement of repairability was also not an option for the Webb Telescope because it is floating far beyond the current capability of astronauts.
So the next step toward reliability came from building the JWST out of component architecture.
Barron went through a brief history of the Space Race between the Soviet Union and the US from 1960 to 1988. He uncovered the pattern that redundancy didn’t actually matter much because the failure modes were shared in both crafts each time, like an alloy wasn ‘t durable enough or a launch was during a sandstorm. He did note that the Soviet space program chose not to publish their mistakes, so they were less likely than NASA to learn from them.
“Redundancy is very good, but sometimes at a system level, it doesn’t solve a problem because the problem is much wider,” which Barron said happens to SREs as well. Kubernetes, for example, has componentization, redundancy and load balancing built-in, but that doesn’t matter if the problem is with the DNS or an application bug. Often reliability demands more than simple redundancy.
The monolith Hubble was designed from the start with repairability and upgradeability in mind. With this repairability out of the picture, there had to be a lot more testing on Webb versus Hubble, for each single point of failure. For example, each mirror was a smaller component that could be realigned remotely. He analogized this to Kubernetes, where you want to allocate the right amount of CPU, memories, and resources available to each and every microservice.
In fact, Webb saw some observability trade-offs because it could only allow for so many selfie cameras to observe its own condition because adding more could affect the temperature and alter its observations.
The Webb SRE Strategy
There’s no doubt that the James Web Space Telescope SRE strategy has more stakes than any enacted on Earth. It still makes for a fantastic example of how site reliability engineering and observability needs vary within the context of circumstances. And that sometimes chaos engineering can only be performed before it goes into production.
Barron observed some of the JWST’s SRE strategy:
- Aim for 100% availability (no room for an error budget)
- Embrace new technologies for a new product
- Invest all efforts in one major deployment
- Maximize functional capacity by reducing monitoring and observability load
- Prioritize nonfunctional requirements, balancing with functional ones
- Create redundant systems, as far as possible
- Reduce technical debt and avoid problems detected in previous deployments
- Identify as many single points of failure as possible, then test for them again and again
- Balance observability requirements — cost, load, complexity — with benefits
- Always test and recognize how testing increases business value
The JWST experiment is also a good reminder that, with fewer stakes than NASA, much more frequent, smaller deployment cadence, and with less than 100% uptime required, you can experiment more with redundancy, repairability and reliability to continuously improve your systems. Under ideally significantly less pressure.
“As SREs, we don’t want to aim for 100% availability. We want the right amount of availability, and we don’t want to overspend — neither resources nor budget — in order to get there. We don’t want to embrace too many new technologies for new products,” Barron said. “A lot of the lessons from Webb are what not to do.”
Disclosure: The author of this article was a host of the WTF is SRE conference.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Saturn.