Success is invisible

A former Amazon teammate, Neil Macneale, posted on Dolt's blog about the project we worked on together in the 2010's: easing the company off of Perforce (and some lesser-used source control systems including Subversion and CVS) and onto Git. I recommend reading it if you haven't already:

Enterprise Git - The Amazon Story

I was unfamiliar with Git until Neil introduced me to it. He made me a believer, then encouraged me to join his team. I remember the excitement of building a newer, better source control system to power all the company's developers, and making it as sturdy as possible. And I can say we succeeded. These days at Amazon, pushing code changes just works. It's a utility that you don't even think about. I imagine a lot of Amazon developers wouldn't even know what you were referring to, if you said "GitFarm" to them. They know they use Git, but they have no need to know the details.

That's true success: when the system works so well that it becomes invisible.

Before we could get there, though, we had to reckon with Perforce.

It doesn't scale

Amazon is infamous for its criticism of shrink-wrap software: "It doesn't scale." The company licenses externally-developed proprietary systems, like Remedy for ticketing or ReviewBoard for code reviews, and they work for a while, but ultimately its engineers outgrow them. The legacy system's Amazonian maintainers flail about in search of a replacement, and often ultimately give up and build something. In this case, Perforce was the system we outgrew, and GitFarm was the in-house replacement.

I should be clear that I'm not criticizing any of these products. For most use cases, they do a fine job. They just couldn't handle the sheer volume of traffic and data that Amazonians generate.

As Neil noted, Amazon's primary Perforce server "was running on the largest host we could get our hands on", an absolute monster with multiple terabytes of disk space RAID-linked to form one logical drive. Amazon's data centers held no larger machines than this.

Under the hood, Perforce used RCS and some tremendously large BDB files to track source files and their revisions. Each change was tagged with a number, the CLN ("change list number"). Querying Perforce about a CLN would tell you who submitted the change, when they submitted it, and all the new file revisions added in that transaction.

In Amazon's early days, no system had yet been developed to organize its files into separate projects. Directory structures in Perforce were somewhat arbitrary and ad-hoc. Somewhere along the way, the company devised the Brazil build system, organizing the company's code into independent packages with metadata, including declared dependencies on other Brazil packages. A directory structure was devised within Perforce, placing these packages into some larger categories, and ultimately assigning each package a unique path within the repository. These paths were, as far as Brazil was concerned, windows into entirely separate projects. But to Perforce, they were all just folders in one gigantic versioned file-space.

Since the server was already on the largest hardware Amazon had, scaling it vertically -- migrating it to a still-larger machine -- was not an option. At one point, the company tried to scale the system horizontally, creating a second Perforce server and attempting to allocate some packages there. As it turned out, though, this was very labor-intensive and not very successful. The division of resources between servers was nowhere near 50/50. The vast majority of packages and files continued to accumulate in the main server.

Over time, the big system behaved worse and worse. There were multiple occasions when corruption of the underlying BDBs, or discrepancies between them, created headaches for developers and for the system's maintainers (my team). Even when this corruption had been pinpointed, it wasn't always possible to clear it up. The system was in danger of failing at a primary duty of source control systems: to say in detail, and with certainty, exactly what changes were made to a file, when, and by whom.

There was also a ticking-time-bomb aspect to the situation. Day by day, the monster's disk was filling up, and we knew it was impractical to acquire a larger one.

The loopy devil

There were other problems with Amazon's Perforce setup, that weren't really Perforce's fault. These had to do with the scripts Amazonians had written to interface with Perforce.

For one thing, our Perforce setup had authorization, but not authentication. That is, you could establish who did or didn't have permissions to write to a directory. But it was easy enough for anyone to pass an option to a p4 command specifying to impersonate someone else. The users did not, generally, have assigned passwords. This is because the scripts often needed to assume those users' permissions. Doubtless this problem could have been solved in some way, but it wasn't.

It got worse. Certain operations, for example to establish a new package (backed by a new directory), required administrative access to the server. This happened often enough to require automation. The necessary scripts (mostly written as Perl modules) were installed directly onto Amazon developers' desktop machines. (In the earlier days, these were physical desktop boxes running Linux and sitting on or under the developer's desk. Gradually these were phased out in favor of cloud computers, but they remained systems on which the developer had unrestricted access.)

Buried within those libraries, if you knew where to look, was the administrator password of the Perforce server -- in plain text. Yes, this password was distributed in the clear to every Amazon developer, and if someone had realized this and intended mischief, they could have used it to execute nasty commands like obliterate (a hard delete operation). Obviously we did not publicize this fact.

Since the password was required for those operations to complete and was hard-coded, it also meant the administrator password went unchanged for years. The password had a "666" in it, and part of it looked a bit like the word "loopy", so within our team it became informally known as the loopy devil password.

I can safely reveal all of this now, because the security gap was eventually closed, and Perforce is now years in the rearview mirror. But just for fun, I'll note that after we finally secured the password (so it was no longer distributed in plaintext) and then changed it, we all got T-shirts bearing the loopy devil password, the ultimate in-joke.

Our unfortunate Perforce setup taught us all some important lessons about what not to do. GitFarm was designed with security in mind from the beginning. Each Git repository in the 'Farm had an associated ownership group, and since ssh was used to connect to GitFarm, every user was authenticated every time. If administrative actions needed to be taken, they could only be performed by members of our team calling the server's API.

Trust, but verify

It's one thing to build a new source control system. It's another to carry forth all the data from the old one.

At some earlier point, another engineer named Brian Maher had assembled a collection of scripts, something like 95% Perl and 5% Python, capable of querying Perforce for the history of a directory-path and assembling a corresponding Git repository, covering just those file-revisions. As Neil noted in his blog, this script became a part of GitFarm. Initially it served as a sort of backup of Perforce, in case the worst were to happen. It also served as a load test of the new GitFarm system, generating traffic that highlighted issues and allowed us to build confidence in the replacement.

A few brave customers were willing early to switch their workflows entirely to Git. To facilitate this, client-side scripts ensured that new Git commits would be translated back into Perforce CLNs, since Perforce remained the authoritative system. A customer's newly-created commits were, in essence, translated into p4 commands and then discarded, and GitFarm subsequently imported the resulting CLNs to create slightly different commits.

As it happened, I became the primary maintainer of that Perl-and-Python import script.

Before too long, I noticed it was challenging to determine whether the Perforce-derived Git history of a package was actually correct or not. It generally looked right, but since I was tasked with investigating customer issues, there were times when I knew something was off. There was no sort of verification on hand to prove that the imported history matched the original. In an effort to debug some customer issues, I set about to write a verifier. I elected not to use any of the existing code, for fear of replicating any bad assumptions. Also, I decided to write the verifier in Ruby.

This was not a random or off-the-wall choice of language. The heart of GitFarm at that time consisted of the standard Git executable, invoked by wrapper scripts written in Ruby. I believe Ruby was chosen because it was flexible and quick to write, and good at interfacing with the Git executable's I/O streams. Fun fact: The primary GitFarm server was written in Java, because that was, at the time of GitFarm's creation, one of a handful of languages compatible with Amazon's Coral framework for devising server APIs. The Java server would invoke the Ruby scripts and parse their outputs. This meant that together with the Perforce importer, GitFarm at that time leveraged four languages: Java for the server, Ruby for the primary scripts, Perl for most of the Perforce import logic, and Python to import raw file revisions.

That way madness lies

At first, as I developed the verifier script, I kept encountering bugs in the verifier itself. I would find a discrepancy, investigate, and determine that the Perl-based importer was right and my verifier was wrong.

With further progress, I started to encounter cases where the importer's logic could be considered incorrect, or at least incomplete. There were edge cases, and occasionally these sparked debates within the team as to what the right behavior was. I made some fixes to the importer and moved on.

In time, I found out exactly how deep the rabbit-hole goes.

As it turns out, Perforce stores files in file-paths. Now that sounds obvious when you say it like that. But evidently, Perforce's designers did not consider some thorny edge cases.

What happens, for example, when the file-path contains a symbolic link ("symlink")? The answer is, a lot of things could happen. Checking out a directory from Perforce could result in recreating a symbolic link as it was originally submitted, then writing a file (or entire directory of files) through the symbolic link.

Git is robust enough never to do this. Git will let you commit a symlink, but won't let you submit another file "through" that link. You cannot import Perforce history like this, and produce Git behavior identical to Perforce's, and frankly, you shouldn't want to.

There were few restrictions on what you could do with Perforce's symlinks. You could have a regular directory within your project folder, and also a symlink pointing to the same directory. Then you could submit one copy of the file through the regular directory, and another copy via the symlink path. Perforce would maintain both copies independently. They had different paths, so Perforce considered them different files. It was as if the file was leading a double life.

What, then, would happen when you checked out the folder locally? One copy would overwrite the other.

Another fun fact: It even proved possible to create a symlink pointing to some location entirely outside the end-user's working directory, such as the system's /etc directory (wherein lurks the password file and other critical stuff). If a file was then created through that symlink, Perforce would read from and, at least, try to write to that file.

It became necessary for me to reach out to the owners of impacted projects and explain what had happened, and suggest ways to correct the contents in Perforce to eliminate the contradictions. Only then did it make sense to import the history.

Automated verification

Gradually our confidence grew, both in the importer and in its supportive twin verifier. Meanwhile, we were putting the pieces in place for GitFarm repositories to be considered the definitive, authoritative source of a software package, leaving Perforce behind. This transition was to happen gradually, a package at a time.

A system called ASCI (Aggregated Source Control Information service) was responsible for tracking what source changes existed, and where to find them. Whenever a change was submitted to any system (Perforce, Subversion, CVS, or the newcomer GitFarm), metadata about the changes would be added to ASCI. The main Brazil system, with its own database, also bore some responsibility to state, with authority, which source control system "owned" a given package.

It turned out that Brazil, ASCI, and Perforce did not always agree on the existing relevant set of packages and their changes. The duties of the verifier expanded to include identifying these cross-system discrepancies as well. In some cases, it was able to correct the discrepancies itself. In other cases, we had to investigate and perform operational maintenance. We gradually accumulated administrative scripts to perform commonly-repeated corrective actions.

We reached a point where it became possible to create an automated action to transfer ownership of a package from Perforce to GitFarm. The verifier would double-check everything, then throw the switch in Brazil. The files would go on existing in Perforce, but as stale copies.

The verification process could take several minutes to complete. In response, I added automatic, proactive verification. Following the import of any new Perforce CLN into GitFarm, the verifier would run, and set a flag indicating the outcome. If it had succeeded, and if no further CLNs had been submitted since, migration to GitFarm could happen almost instantly.

Tracking this information also gave us a window into which package histories were still not verifiable. This served as a progress indicator that was previously unavailable to us. We could now see how many packages were ready to move.

The big push

By this point, many of our customers had moved their packages because they wanted to. We began reaching out to the other package owners in earnest, telling them they soon must migrate. We made clear our intentions to shut Perforce off.

As you can imagine, Amazon owns a lot of code. Not all of it has a clear owning team. Some of this code was written years ago, might still be in use, but had needed changes very rarely.

Neil mentioned in his blog post that there were about 40K Brazil packages in 2009. By the time we were preparing to shut Perforce off, this number was rounding 100K.

After giving the company sufficient heads-up, I kicked off a script to auto-migrate more than 90K of them.

I learned an interesting lesson in the years following this exercise: When you're mass-migrating something, don't stamp your own name on it. Because I had initiated the package moves using my account, my username was left imprinted on all 90K+ Brazil packages. It was chiseled there indelibly, as part of the Git history. Even though it was only visible in the metadata of the package (migration did not change the contents of any files), my name was still there to be found. More than eight years later, I occasionally get contacted by developers hoping I can explain something about a very old package. I have to explain to them that I don't own the package and bear no responsibility for, nor knowledge of, its contents.

Fortunately though, what we did not hear was a huge outcry. We didn't break the company. We freed all those packages from Perforce's limitations, and life continued on.

The stragglers

In any large and long-lived data plane with insufficient safeguards, you can expect things to get messy. Within the long tail of Brazil packages hosted in Perforce, we encountered some goofy cases.

What stands out in my memory is a single package that contained several other packages. Brazil packages weren't meant to be nested. But recall that, as far as Perforce was concerned, it was all just file-paths. The concept of a "Brazil package" did not exist in Perforce-world. As long as developers were able to check out the relevant directory and make changes, and as long as Brazil was able to build the code within that directory and produce working output, the fact that the packages were nested made no practical difference to either. But it played merry hell with the importer.

We had some conversations with the package owners, and worked out a way to move forward. If I recall, we managed to import the nested packages, but not the "parent" package. The code's owners found another way to preserve the outlier package's code.

In time, the number of Perforce-hosted packages dwindled -- not quite to zero, but to a handful, few enough that we could justify letting the rest drop.

After another heads-up announcement, we turned Perforce off. We kept the server around for a while, inactive. When no further problems arose, we kept only the backups and released the server.

We haven't looked back.

A good fit for the project

I enjoyed this work because it solved an important problem impacting the entire company, while playing to my strengths:

  • I'm detail-oriented. I used the verifier to hunt down discrepancies. When we migrated a package, I was confident that its current contents, and its entire history, matched what was in Perforce (to the extent that what was in Perforce was sensible).
  • I'm a deep-diver and a persistent cuss. The longer I work on a thorny problem, the more I learn about it. Given time, I become the subject-matter expert. I've forgotten a lot of the details in the intervening eight-plus years, but back then, I could have told you more about Perforce's oddities than you would ever care to know.
  • I like to build tools to solve problems. When I can automate a process to find issues, fix issues, and make things better in some way, I'll do that. The more tools I make, the more progress I can make with them. Sometimes I build bigger tools on top of the original ones.
  • I'm pragmatic. I have a clear favorite language (I'm a Ruby enthusiast) but I never rewrote the Perl/Python importer. Once I could prove it worked, I just maintained it as-is because that took less time and effort. I kept maintaining it until the migration rendered it moot, then we threw it in the bin.
  • I obsess over quality, and it shows. My teammates and I took great care in making GitFarm as hardy as we could. We built it like a tank; we built it to last. The company doesn't experience broad outages any more. The source control system has become invisible, as it should be.


Comments

Popular posts from this blog

18 Lessons Learned After 18 Years At Amazon

Being Intelligent, Whatever That Means

Here, There Be Dragon... Rubies...