Blog of Rob Galanakis (@robgalanakis)

Modern computing is fast

There was a bug I fixed that I was reminded of recently. It’s was small bug – a one line fix – and it didn’t take too long to track down, but it left an impression as to how fast modern computing is.

We had some code in an (old) request handler for when a user updates their profile, like this:

user.profile.update( profile_fields )
user.update( user_fields )

We had some other code in another part of the system (asynchronous jobs) that listened for profile updates, and would send an email:

on 'profile.update' do
  send_email( 'profile_update', user: user, profile: user.profile )
end

We upgraded our RabbitMQ boxes to be more powerful, and then intermittently the email would have updated profile information but outdated user information.

You may be able to intuit the bug even with this totally dumbed down example. The asynchronous job ‘profile.update’ event handler was running between the two update calls. There are 4 virtual machines involved here, so I’ll try to illustrate what each of them is doing, and where the race condition existed.

Request Handler (Ruby) Postgres Rabbit Async Job (Ruby)
Generate profile row update SQL
Send profile row update SQL ->
-> Write changes to profile row
<- Send reply
Receive reply <-
Publish profile update event ->
-> Write event
<- Send receipt
Receive receipt <-
Generate user row update SQL Send event for consumption ->
-> Parse and route event
Generate user query SQL
<- Send user row query SQL
Query user row <-
Send reply ->
-> Receive reply
Send user row update SQL -> -> Write changes to user row

The race condition is that the Async Job code was running before the user.update running on the Request Handler was committed to the database. So the Async Job code had an “inconsistent” view of the data as intended. Before the RabbitMQ machine upgrade, those machines took long enough on the ‘(RabbitMQ) Send event for consumption’ step that the ‘(Request Handler) Send user row update SQL’ step completed before ‘(Async Job) Generate user query SQL’ step. As often happens, a small speedup, negligible in the context of the application, exposed the race condition, so that that order of those latter two steps was swapped.

The fix was simple: wrap the updates in a transaction, which causes our event publishing to be deferred until the transaction commits:

user.db.transaction do
  user.profile.update( profile_fields )
  user.update( user_fields )
end

Notice that the events are only published after all DB work has happened:

Request Handler (Ruby) Postgres Rabbit Async Job (Ruby)
Begin transaction ->
-> Begin transaction
<- Send reply
Receive reply <-
Generate profile row update SQL
Send profile row update SQL ->
-> Write changes to profile row
<- Send reply
Receive reply <-
Generate user row update SQL
Send user row update SQL ->
-> Write changes to user row
<- Send reply
Commit transaction ->
-> Commit transaction
<- Send reply
Receive reply <-
Publish events ->
-> Write events

Now when the email is sent, all updates have been committed to the database.

The thought provoking parts to me were in the original bug flow. An incredible number of things happened between RabbitMQ sending the ‘profile.update’ receipt and the user row update SQL being written to the database. In what seems like a trivial line of code (updating another database row), an event is consumed by a RabbitMQ consumer, Ruby code deserializes and routes the event, and a SELECT query is sent to and executed by Postgres. Even with a relatively good knowledge of all the moving parts, I was surprised with how quickly modern computing and infrastructure is able to send bits flying from one place to another. It’s nice to be reminded how large, complex, and for the most part well-behaving and performing modern computing is.

Leave a Reply