Seibeclass="underline" I’m curious about the process that you went through of scaling LiveJournal, in terms of where you started and how you learned the lessons you needed to learn along the way.
Fitzpatrick: So it started on one shared Unix box with other customers and pretty much killed that.
Seibeclass="underline" Running as CGIs?
Fitzpatrick: Yes. Yeah. I think it was probably a literal CGI, fork up the whole world and die. There was a guy assigned to me at this ISP. I was having problems with my server dying all the time. I’m like, “I paid my $10.00 a month. Why isn’t it working?” So he would say, “Oh, do this.” Pretty soon I was learning Unix and learning what was actually going on.
Then I converted to FastGCI. Then I tuned Apache and turned off reverse DNS lookups. All these steps you go through. Finally, I was I/O-bound or CPU-bound. Then I got my own dedicated server, but it was still just one and it was dying and I was out of capacity. I had originally opened it up for my friends and I just left the signup page alive. Then they invited their friends who invited their friends—it was never really supposed to be a public site. It just had an open signup page on accident. So then I put something up on the LiveJournal news page and I said, “Help. We need to buy servers.”
I think that raised maybe six or seven thousand dollars or something to buy these two big Dells and put them in Speakeasy in downtown Seattle. Somebody recommend some servers, Dells, these huge 6U things, like ninety pounds each. The logical split was the database server and the web server. That was the only division I knew because I was running a MySQL process and an Apache process.
That worked well for a while. The web servers spoke directly to the world and had two network cards and had a little crossover cable to the database server. Then the web server got overloaded, but that was still fairly easy. At this point I got 1U servers. Then we had three web servers and one database server. At that point, I started playing with three or four HTTP load balancers—mod_backhand and mod_proxy and Squid and hated them all. That started my hate for HTTP load balancers.
The next thing to fall over was the database, and that’s when I was like, “Oh, shit.” The web servers scale out so nicely. They’re all stateless. You just throw more of them and spread load. So that was a long stressful time. “Well, I can optimize queries for a while,” but that only gives you another week until it’s loaded again. So at some point, I started thinking about what does an individual request need.
That’s when—I thought I was the first person in the world to think of this—I was like, we’ll shard it out—partition it. So I wrote up design doc with pictures saying how our code would work. “We’ll have our master database just for metadata about global things that are low traffic and all the per-blog and per-comment stuff will be partitioned onto a per-user database cluster. These user IDs are on this database partition.” Obvious in retrospect—it’s what everyone does. Then there was a big effort to port the code while the service was still running.
Seibeclass="underline" Was there a red-flag day where you just flipped everything over?
Fitzpatrick: No. Every user had a flag basically saying what cluster number they were on. If it was zero, they were on the master; if it was nonzero, they were partitioned out. Then there was a “Your Account Is Locked” version number. So it would lock and try to migrate the data and then retry if you’d done some mutation in the meantime—basically, wait ’til we’ve done a migration where you hadn’t done any write on the master, and then pivot and say, “OK, now you’re over there.”
This migration took months to run in the background. We calculated that if we just did a straight data dump and wrote something to split out the SQL files and reload it, it would have taken a week or something. We could have a week of downtime or two months of slow migration. And as we migrated, say, 10 percent of the users, the site became bearable again for the other ones, so then we could turn up the rate of migration off the loaded cluster.
Seibeclass="underline" That was all pre-memcached and pre-Perlbal.
Fitzpatrick: Yeah, pre-Perlbal for sure. Memcached might have come after that. I don’t think I did memcached until like right after college, right when I moved out. I remember coming up with the idea. I was in my shower one day. The site was melting down and I was showering and then I realized we had all this free memory all over the place. I whipped up a prototype that night, wrote the server in Perl and the client in Perl, and the server just fell over because it was just way too much CPU for a Perl server. So we started rewriting it in C.
Seibeclass="underline" So that saved you from having to buy more database servers.
Fitzpatrick: Yeah, because they were expensive and slow to migrate. Web servers were cheap and we could add them and they would take effect immediately. You buy a new database and it’s like a week of setup and validation: test its disks, and set it all up and tune it.
Seibeclass="underline" So all the pieces of infrastructure you built, like memcached and Perlbal, were written in response to the actual scaling needs of LiveJournal?
Fitzpatrick: Oh, yeah. Everything we built was because the site was falling over and we were working all night to build a new infrastructure thing. We bought one NetApp ever. We asked, “How much does it cost?” and they’re like, “Tell us about your business model.” “We have paid accounts.” “How many customers do you have? What do you charge?” You just see them multiplying. “The price is: all the disposable income you have without going broke.” We’re like, “Fuck you.” But we needed it, so we bought one. We weren’t too impressed with the I/O on it and it was way too expensive and there was still a single point of failure. They were trying to sell us a configuration that would be high availability and we were like, “Fuck it. We’re not buying any more of these things.”
So then we just started working on a file system. I’m not even sure the GFS paper had published at this point—I think I’d heard about it from somebody. At this point I was always spraying memory all over just by taking a hash of the key and picking the shard. Why can’t we do this with files? Well, files are permanent. So, we should record actually where it is because configuration will change over time as we add a more storage nodes. That’s not much I/O, just keeping track of where stuff is, but how do we make that high availability? So we figured that part out, and I came up with a scheme: “Here’s all the reads and writes we’ll do to find where stuff is.” And I wrote the MySQL schema first for the master and the tracker for where the files are. Then I was like, “Holy shit! Then this part could just be HTTP. This isn’t hard at all!”
I remember coming into work after I’d been up all night thinking about this. We had a conference room downstairs in the shared office building—a really dingy, gross conference room. “All right, everyone, stop. We’re going downstairs. We’re drawing.” Which is pretty much what I said every time we had a design—we’d go find the whiteboards to draw.
I explained the schema and who talks to who, and who does what with the request. Then we went upstairs and I think I first ordered all the hardware because it takes two weeks or something to get it. Then we started writing the code, hoping we’d have the code done by the time the machines arrived.
Everything was always under fire. Something was always breaking so we were always writing new infrastructure components.
Seibeclass="underline" Are there things that if someone had just sat you down at the very beginning and told you, “You need to know X, Y, and Z,” that your life would have been much easier?
Fitzpatrick: It’s always easier to do something right the first time than to do a migration with a live service. That’s the biggest pain in the ass ever. Everything I’ve described, you could do on a single machine. Design it like this to begin with. You no longer make assumptions about being able to join this user data with this user data or something like that. Assume that you’re going to want to load these 20 assets—your implementation can be to load them all from the same table but your higher-level code that just says, “I want these 20 objects” can have an implementation that scatter-gathers over a whole bunch of machines. If I would have done that from the beginning, I’d have saved a lot of migration pain.