A Case for an Early Rewrite

I’ve started listening to Maintainable, and one of Robby’s standard questions is whether you’re on Team Refactor or Team Rewrite.

I’m on Team Refactor. The best way to achieve a rewrite is through a series of incremental refactors.

So it’s a little awkward to admit that the best refactor I ever shipped started as a rewrite. The two aren’t opposites. Done right, the rewrite is the thing that buys you the right to refactor. And a looming deadline is the best excuse you’ll ever get to start early.

Background

I joined MealPal in 2016 as the first engineer in New York. The first version of the app, the one that proved product-market fit and earned the company a Series A, was built on Parse, a hosted backend-as-a-service. At the time it was essentially a fully-managed, lambda-like API layer sitting on MongoDB, with a spreadsheet-like admin interface bolted on top. You wrote “cloud code” functions, pushed them, and Parse ran them. It was a fantastic way to get to market fast without an ops team.

Facebook bought Parse in 2013. In early 2016 they announced it would shut down for good on January 28th, 2017. By the time we got serious about a replacement in late September, we had four months.

That deadline was a gift. We had 15,000 subscribers across six cities, four codebases nobody could hold in their head at once, and roughly 300 requests a second hitting api.parse.com every lunch rush. Most bugs were non-obvious. Some failures went unnoticed for days. Everyone agreed the foundation wasn’t enough for what came next, but agreement isn’t time, and there’s never time. The shutdown gave us the one thing a rewrite always lacks: permission to say no to new features and rebuild the thing underneath.

The Constraint

Here’s the part people skip when they argue rewrite versus refactor. We had iOS, Android, and Angular web clients in the field, in the hands of paying customers, that we could not change on our schedule. App Store review alone meant a bad cut could strand users for a week.

So the rewrite had a hard rule baked in from day one: customers could not feel it. Whatever we built had to speak Parse’s exact wire protocol: same routes, same payloads, same cloud-code contracts. The apps already on phones would never know the backend had been swapped out from under them.

We split the work into two parallel tracks.

Project Shadow was the insurance policy. Stand up our own open-source Parse server on our own MongoDB, point the clients at api.mealpal.com instead of api.parse.com, and prove we could carry the load ourselves. If everything else slipped, Shadow alone got us off Parse before the lights went out. A lift-and-shift: same architecture, new landlord.

Project Idol was the bet. The codename is the Indiana Jones gag. Indy eases the golden idol off its pedestal and drops a bag of sand in its place, weighed to the gram, hoping the temple never notices the swap. That was the assignment exactly: lift Parse out and set Rails down in its place at precisely the same weight on the wire, so the booby trap, every app already in a customer’s pocket, never sprang.

Indiana Jones swapping the golden idol for a bag of sand of matching weight

A brand-new API in Ruby on Rails, backed by PostgreSQL, extremely faithful to the Parse API and nothing like Parse underneath. The engine that powered it has a one-line summary in its gemspec that still makes me smile:

JSON API Endpoints that act just like Parse.

Idol is where the actual rewrite lived. Shadow let us survive; Idol let us move.

Quacking like Parse

The shape of Idol is easiest to see in the routes. Parse exposes cloud-code functions as POST endpoints under /1/functions/. So we did too, and pointed each one at a clean, conventional Rails action:

scope path: 'functions' do
  post :getCurrentUser, to: 'users#get_current_user'
  post :checkKitchen3,  to: 'kitchens#check_kitchen_three'
  post :getByCity,      to: 'meals#by_city'
  post :reserveMeal2,   to: 'meals#reserve_meal_two'
  # ...fifty more
  post :saveAuditLog,   to: 'stubs#noop'
end

Parse on the outside, idiomatic Rails on the inside. checkKitchen3 is Parse’s name; check_kitchen_three is ours. Parse sends camelCase params, so a before_action ran params.deep_transform_keys!(&:underscore), and the rest of the controller never had to know. And saveAuditLog, a function some old client still called on a timer, we just wired to a no-op. The client kept POSTing into the void, perfectly happy.

“Faithful to the protocol” sounds tidy. It isn’t. The devil lives in every place Parse made a decision we wouldn’t have.

Parse uses POST for everything. Even reads. getCurrentUser is a POST. getByCity is a POST. That was the bad day. Somewhere around week three I realized every naive caching strategy I’d sketched was dead on arrival, because browsers and CDNs won’t cache a POST. We ended up explicitly opting reads back into caching, which is exactly as backwards as it reads:

def caching_allowed?
  (request.post? || request.get? || request.head?) && response.status == 200
end

Parse hands out opaque object ids; PostgreSQL wanted UUIDs. The first real milestone wasn’t a route at all. It was proving we could get our data out. ParseModel carried a parse_client that spoke Parse’s own REST API and pulled every record down into Rails.

Once the data was ours, we had a re-keying problem. The apps in the field knew every record by its short Parse objectId, but we’d rebuilt the schema with UUID primary keys, and both stores, Mongo under Shadow and Postgres under Idol, had to point at the same record during the bi-directional sync. We mapped each Parse objectId onto a UUID primary key and kept every original id alongside it while the two stores ran in parallel, then dropped those columns once Parse was gone.

Parse’s session tokens had a particular shape, and thousands of phones were holding them. Parse prefixes its tokens with r:. Our login minted them the same way:

def parse_session_id
  "r:#{session.id}"
end

A small piece of Rack middleware sat in front of the app to promote an old Parse token into a real Rails session on the way in. If a request showed up with no Rails cookie but a valid 32-character Parse token in the header, we forged the cookie and moved on:

def call(env)
  @env = env
  promote_parse_session_token! if !rails_session? && valid_parse_session_token?
  @app.call(env)
end

def valid_parse_session_token?
  parse_session_token.length == 32
end

Every phone in the field stayed logged in through the cutover. Not one forced re-login.

Why Rails

Not everyone was thrilled to leave Node behind. I won’t relitigate it here, but the reason was concrete. This is the function I wrote one night early on to check whether a city’s kitchen was open, leaning entirely on ActiveSupport’s timezone support:

class Kitchen
  def self.open_at?(local_time)
    !(ExceptionDate.closed_at?(local_time) || ExceptionDate.closed_at?(local_time.tomorrow)) &&
      local_time.wday.between?(0, 4) &&
      local_time.to_time_of_day >= open_time &&
      local_time < local_time.tomorrow.to_date.at(close_time)
  end
end

The old City model stored a bare integer offset, -4, and leaned on a hand-maintained hash mapping offsets back to zone names. The comment, in full: # because we store -4 in the cities table /facepalm. That’s fine right up until daylight saving time files for divorce. We moved cities to named zones, America/New_York instead of -4, and let ActiveSupport::TimeZone do the arithmetic. At the edge, for the old clients, we translated the named zone back into the integer offset they still expected. Clean model on the inside, Parse-shaped JSON on the wire. That pattern repeated everywhere.

Idol wasn’t a monolith, either. We built it as a hub-and-spoke monorepo. One shared core engine held the ActiveRecords and the business logic, and each audience got its own Rails app: customer web, the customer API, the merchant portal, the kitchen tools. Every one of them mounted core. One audience per app meant a consumer could be authorization-free by construction, and no two apps had to agree on a gem version to share the same database. Deploying that to Heroku, which expects one app per repo, took a couple of small tools I wrote: unibus and a hub-and-spoke buildpack. It’s also why slicing a consumer off the Parse facade later was tractable: each audience was already its own app.

The Actual Refactor

Here’s what I wrote to the team in that September memo, before any of it had shipped:

The time we spent thinking about and designing the future database schema wasn’t wasted. With every feature, we’ll grab pieces from our ideal schema. Incrementally, we’ll get closer and closer to it, solving little headaches along the way. We’ll also start to add standard, RESTful models to the client applications to gradually phase out our use of the Parse SDK — piecemeal, so as not to take on too much risk at one time.

That’s the whole argument. The rewrite was never the destination. It was the foundation that made incremental work possible at all.

Once Idol was live and faithful, we started eating the elephant. Every legacy POST /1/functions/... route became a candidate to hollow out and replace with a canonical REST route like GET /api/v2/menu, one at a time, behind the compatible shell. New features got built against the clean schema. Old clients kept talking to the Parse-shaped facade until, app release by app release, they didn’t need to anymore.

You cannot do that on a foundation you rent and can’t control. Refactoring assumes you own the ground you’re standing on. We didn’t — until the rewrite gave it to us.

What It Took

The rewrite took about three months. The deadline did the political work no architecture diagram ever could: nothing aligns a leadership team like the threat of your production database being deleted on a date certain. Buy-in was instant. Headcount appeared. “No” to new features became an acceptable sentence.

It was not clean to the end. Two days before go-live, somebody asked who had our SSL certificate. The answer was a silence, then I thought you had it. We sorted it out. We shipped. The phones in the field never noticed the ground had moved.

So I’m still on Team Refactor. I just no longer think that puts me on the opposite side of a rewrite. Sometimes the fastest way to earn the right to refactor is to rewrite the foundation first. And if a vendor hands you a shutdown date, take it. The best rewrite is the one that leaves you with a codebase you can keep refactoring for years.

Background#

The Constraint#

Quacking like Parse#

Why Rails#

The Actual Refactor#

What It Took#