What's the difference?
The problem in a nutshell
You have a large, detailed, complex blueprint describing the world as you want it to be. You have another equally hairy document, a description of the world as it is today. If you decide to reify your blueprint, to breathe life into it, what about the world is going to change?
Answering this question is critical, because someone has to review this blueprint for mistakes. You want to find those mistakes before they impact customers, not after. When your blueprint is large, no human can effectively scrutinize every line. You must narrow it down to what is changing.
Why I care
I've thought about this problem a lot. Since 2018 I've been on a Builder Tools team at Amazon that specializes in Infrastructure-As-Code (IAC). The product we own, "Live Pipeline Templates" (LPT), is in essence a scripting library that lets our customers describe what infrastructure they want, across every AWS region. If you've worked with AWS, you probably know CloudFormation. LPT is a lot like CloudFormation, with some key differences, most of which which I'll gloss over. Critically, though, the two products serve different customer bases: CloudFormation serves external AWS customers, while LPT serves Amazonian engineers and talks to internal services (the ones that underlie and predate AWS). When AWS spins up a new region, a lot of internal services have to recreate their infrastructure there, and that infrastructure needs to be mostly the same as in other regions -- except, of course, in the ways it must be different.
This problem of identifying differences between two blueprints isn't confined to any one infrastructure framework. In fact it's a surprisingly general problem to face. The instructions to a 3-D printer are a blueprint. If you're churning out successive versions of a 3-D printed widget, comparing blueprints might yield insights. We also needn't limit ourselves to describing built things. A description of a complex process, like the steps to assemble a car (the layout of an assembly line, the order in which parts are assembled), could be written as a structured and hierarchical set of steps, and likewise subjected to a comparison between "what we did last year" and "what we're doing today".
"Blueprint", then, really means any sort of specification that can be broken down into logical parts. The details will vary a lot, but I'd expect the underlying principles to be the same.
When is it signal and when is it noise?
When you compare two blueprint versions and find no differences at all, that is a powerful insight. It tells you a lot.
When you compare two blueprint versions and the report contains two hundred differences, or two thousand, suddenly the report is useless. You cannot drink from the firehose.
When your comparison yields just a few key differences, perhaps seven or less, it is once again useful and actionable. You can comprehend the output you're reading. You can focus on identifying whether the differences you see are those you want and expect.
This is the paradox: A few differences are signal. Too many differences, and they become noise. The dividing line between these outcomes is far from clear. But you certainly develop a feel for it.
Data versus information
Let's suppose my blueprint is a recipe for ice cubes. One version of the recipe says "refrigerate to 32 degrees Fahrenheit". The other says "chill to zero degrees Celsius". Is there a difference? Sort of. You might offer different copies of this recipe to different audiences (Americans versus Europeans), but it won't affect the quality of the ice.
When you're writing code to compare the blueprints, your poor stupid code sees two strings, one containing "refrigerate to 32 degrees Fahrenheit" and the other containing "chill to zero degrees Celsius". This is the data, and it's different. Now you have to understand its meaning, to recognize that the difference is immaterial. Otherwise you'll generate noise. You must parse the data to turn it into information.
Canonica-what-now?
I've heard different terms used for this part of the process. Some call it standardization. I don't favor this term, because a "standard" often allows for multiple legal representations of the same state. Another term I dislike is sanitization. This is too easily confused with the process of removing sensitive information, such as passwords or credit card numbers.
Anyone familiar with comic books, or sci-fi shows like Star Trek, has likely heard the fan arguments about what is or isn't canon. If an event is canon ("Superman was born on Krypton"), it is not just widely accepted but authoritative. A character may have an alternate timeline in some book ("Superman was created in a laboratory"), but there is only one character timeline that is canonical.
The word I prefer is canonicalization. It means there is one acceptable format for the data, and you coerce everything to that format.
Expressing temperatures can be done in Fahrenheit, or in Celsius. They can also be expressed in Kelvin, but that's primarily for the physicists. When your audience is readers of recipes, either Fahrenheit or Celsius would be standard (they're both common and widely accepted). If you want to canonicalize your recipe, you have to pick one system of measurement or the other. When your two recipe versions have been coerced to say "0 degrees Celsius" and "0 degrees Celsius", even your poor stupid code can see they are the same. Properly canonicalized data will be identical if, and only if, it has the same meaning. It has graduated from data to information.
Who chooses the format?
Sometimes an authority emerges. Should you write a street address as "306 north Seventh Street"? The US Post Office has an opinion on that, and would kindly prefer you to write "306 N 7th St".
Email addresses are case insensitive. Whether you address a message to "JohnDoe@MyCompany.Com" or "johndoe@mycompany.com" doesn't matter: the message will still arrive. An easy way to canonicalize, then, is to lowercase the whole thing. There isn't a clear authority on this, though; you could just as easily uppercase (and I've seen some government websites do this). For the purposes of eliminating noise from the diff, either approach works -- and so would stupid ideas like uppercasing every third letter, "joHndOe@mYcoMpaNy.cOm". Any such scheme would satisfy our needs as long as we are consistent. Whatever we do to one version, we must do the same to the other. Two canonicalized email addresses must be identical if, and only if, they deliver to the same inbox. That is the email address's meaning, or essence.
Canonicalizing email addresses is of particular interest to the authors of websites whose customers log in with their their email address. You don't want to force your customers to remember which letters they capitalized when they created their account. Lowercase that thing when you write it to the database, and every time you search for it.
What's in a name?
For this endeavor, names are everything.
Email addresses have a wonderful property: They are globally unique. You don't have to worry that two users, in different countries for example, have the same email address.
Phone numbers are a bit trickier. Each country has its own numbering scheme. A phone number can't be considered globally unique unless you include the country code prefix, yet it's common to write, and store, phone numbers without this crucial piece of context.
Within AWS, some resources are truly global, like S3 bucket names. No two customers, anywhere in the world, are allowed to create S3 buckets with the same name. Others are particular to an AWS account: another customer may own a resource with the same name as yours (a role named "Administrator", say), and neither of you will know about the other. The AWS account is the resource's world, and the two worlds are kept separate. Still others are scoped to an AWS account and a region. You can have one instance "MyResource" in us-west-2, and another "MyResource" in us-east-1 that leads an independent existence.
When you're comparing two blueprints, it's essential that you avoid mixing these things up. The general principle here is to find a way to make the name globally unique. Treat it as if its name is "123456789012-us-west-2-MyResource", prepending the AWS account ID and the region. To get this right, you have to know the resource's scope. If you scope the ID too narrowly or too widely, you'll run into trouble.
The string "123456789012-us-west-2-MyResource" starts to look a bit like an Amazon Resource Name (ARN), which is no accident. Of course this is why ARNs exist. You need not develop a naming scheme for AWS resources: you can use Amazon's. Within Amazon's internal services, many of which predate AWS, understanding whether each service's data spans two given regions or not (something my team calls service cardinality) has been one of the challenges. Without a standard practice to follow, internal service owners made a variety of different choices. Among other things, my team ended up developing an ARN-like naming scheme for internal resources, and spent a lot of time and effort to get it right.
The arrow of time
When you're comparing two copies of a blueprint, then, your first job is to ensure that every piece, every element of the complex nested structure of each blueprint, has a clear and distinct name. Having tackled that, you can line up the elements in one blueprint with the corresponding elements in the other.
Now there are three sorts of changes you can detect:
- Additions: An element is in the newer blueprint that was absent in the older one.
- Deletions: An element was present in the older blueprint that has no analogue in the newer one.
- Mutations: There's an element in both blueprints with the same name, but they're not identical.
The problem of renames
- Before: librarian@unseen-university.edu with species "human"
- After: librarian@unseen-university.edu with species "orangutan"
- Before: GalderWeatherwax@unseen-university.edu with role "archchancellor"
- After: MustrumRidcully@unseen-university.edu with role "archchancellor"
Dealing with bad data
Our product LPT has been around for a number of years, and some customer templates are almost as old as the product. Some of these have (shockingly) not been maintained, and no longer accurately describe resources as they exist. Some contain data that makes no sense.
Here's an exercise in categorization. Suppose we're considering four identities:
- A: The FBI Director of Kansas
- B: The FBI Director of Missouri
- C: The FBI Director of Discworld
- D: The FBI Director of Ankh-Morpork
Which of these are the same as the others?
The office of FBI Director makes sense within the United States and exists at the federal level; individual US states don't have FBI Directors. The FBI Director of a state is really the FBI Director of the country. That means 'A' and 'B' are (rather awkward, but unambiguous) ways of referring to the same person or role.
Discworld, the round, flat world on the back of four elephants, themselves carried by Great A'Tuin (Chelys galactica) as it swims through the inky blackness of space, is regrettably fictional and cannot be said to be part of the United States, so clearly 'C' does not refer to the same person or role as 'A' or 'B'. It refers to nothing real; it is bad data.
Ankh-Morpork is a fictional city on the fictional Discworld. Do 'C' and 'D' refer to the same person or role? Can you do the same widening conversion, since Ankh-Morpork is a part of Discworld?
This is tricky, but I've concluded the answer is no. It would be a mistaken attempt to assign meaning to bad data where there is none. If you encounter invalid identifiers, then, you must make no effort to understand them well enough to match them up. If possible, you should reject such bad data. However, if you have no choice but to make a best effort to process a blueprint containing bad data, you should treat any illegally-identified object as an island unto itself.
Read/write parity: The curse of the round-trip
Nothing built from a blueprint will ever look exactly like the blueprint. You know what they say: "In theory, practice and theory are the same. In practice, they are not."
Let's return to the world of Infrastructure-As-Code. When your blueprint dictates to create a resource in a service, you package up the relevant values into an API call to a write operation, and send them on their way across the Internet. The service receives them, and hopefully stores them faithfully in its database. Later on, you may want to know the differences between what's in the blueprint, and what's actually stored in the service. In short, you want to compare the blueprint with reality. So you make a corresponding read call to the same API. Then you discover that services often make your seemingly straightforward task more challenging than it should be.
- Servers may trim, standardize, or canonicalize your input in ways you were unaware of, so that the value you read is not the value you wrote.
- Servers may accept optional fields, and supply default values for those you leave unspecified, then present the defaults to you such that they look like unsolicited additions.
- Servers may invent new fields, especially new identifiers (often monotonically increasing numbers or randomized UUIDs), that you aren't allowed to specify in your request to create the resource, and whose value you cannot pre-calculate. These need to be backfilled into the blueprint after creation, or they too will appear to be additions.
- Servers may accept a value on write and do something with it, but refuse to give it back to you on a subsequent read. It looks like a deletion.
- Servers may use a different form or structure of the data when creating it than when retrieving it.
All of this falls under the umbrella of read/write parity. Service designers may have put little thought or care into designing their API for parity. As long as the service fulfills its function correctly, what does the interface matter? Right? (Here, I have some aspirin for you.)
How to cope with parity issues
There are many ways service designers can make this easier on their customers who are seeking to do before/after comparisons:
- Think carefully about the right ways to canonicalize fields, and publish a description of how to do that canonicalization along with the other API documentation.
- Devise a way of identifying resources that can be pre-calculated by callers.
- Unless the default value is extremely obvious, just don't use optional fields. Make every field mandatory! Give customers an example call to copy/paste, and then tweak. If that's too onerous, then publish the defaults as another part of the documentation.
- Whatever you do, don't change the default value of a field over the lifetime of an API operation. Instead, create a new operation in the API that uses the new default, and encourage callers to migrate. Otherwise, you are changing the behavior for callers who have made no changes to their inputs. As well-intentioned as such a change may be, it is liable to surprise some portion of your customers.
- Strive to return in read operations everything you were given in write operations, and in the same form. If that seems difficult to do, perhaps it indicates an issue with the design of your API or of your service.
As the author of the code doing the comparison, when you encounter these obstacles, you're forced to learn many of the service's implementation details so you can replicate them. When the blueprint leaves an optional field blank and the service is known to apply a default value, you must know what that default is, and apply it to the blueprint so that they match. When the read call does not return a value for a field at all, you must (temporarily) remove the corresponding field from the blueprint. And so on.
Metadata and other oddities
Assuming you're past all of the above hurdles, the report is likely still too noisy.
Your blueprinting system is, after all, an attempt to add value on top of what the underlying systems already offer. It would be unusual if you could achieve this without adding more data that is specific to your own system: categorizing it, adding conditionals, encoding customer preferences of all sorts that reside on a higher level of abstraction. When this metadata is tied to specific elements in the blueprint, remember to purge those values before doing a comparison.
Another piece of "metadata", if you can call it that, is authentication. How your system identifies itself to the underlying system is an important piece of information. After all, if we're talking about AWS accounts, the AWS account ID is part of the object's logical identifier. You need to know, say, the access key and secret key of an IAM role in that account. But none of this is an argument to any read/write operation and won't be part of the return value. It makes sense to tie this information logically (directly or indirectly) to the object, but after the read call is made, it must be removed from the report because it won't be part of the output.
It's still too noisy
Did you think all of the above was the hard part? Now comes the hard part. You've purged the report of completely unhelpful differences, what we might simply call "garbage". What's left are meaningful, important differences, but the report still contains hundreds of differences, not a manageable number. The reader is still drinking from a firehose.
Suppose you're looking at the infrastructure of a service that spans fifty regions. Forty-four of the differences look like this:
- Region us-east-1: Max connections changed from 25 to 50
- Region us-west-2: Max connections changed from 25 to 50
- Region ap-south-1: Max connections changed from 25 to 50
At first glance, there is a pattern here, which is clear to you or me, although not to an algorithm. Spotting similarities is the sort of work that you might task a GenAI system with doing, but trusting an unpredictable system to make important observations carries risk. These are the sort of questions posed by this problem:
- When did all these differences arise? Is your blueprint wrong? Did somebody change something without your knowledge? Why?
- Can all the differences be grouped like this? How similar is similar enough? Are any of them qualitatively different? What if just one item says "25 to 30", or "Max queue size" instead of "Max connections"?
- There are 44 differences, but 50 regions. Which are the outliers here? The 44 or the unmentioned 6?
- What if some regions happen to receive higher traffic than others and therefore have different requirements? If you execute your blueprint, will you squash an important difference?
Comments
Post a Comment