09 March 2016

Active Record is great. Being able to split up large SQL queries into meaningful bits, and traversing relations with ease is invaluable. Yet some discipline remains needed. Using eager loading to avoid N+1 queries, to name something well-known to any Rails developer who has some real-world experience.

For smaller projects, this is usually fine: the controller does the preloading, all set. But when it grows, and more and more functionality is moved to concerns and other places, it’s not so straightforward anymore which associations are required. Now the controller needs to know all tables that are referenced from all methods called down the line.

ActiveRecord’s over-eagerness

Adding an includes to a relation always does a database call, you know.

To show this, consider a product model with nutritional info:

class Product < ActiveRecord::Base
  has_many :product_nutrients, dependent: :destroy, inverse_of: :product
  has_many :nutrients, through: :product_nutrients
end

class ProductNutrient < ActiveRecord::Base
  belongs_to :product
  belongs_to :nutrient
  belongs_to :nutrient_unit
  validates :product, presence: true
  validates :nutrient, presence: true
  validates :nutrient_unit, presence: true
  validates :value, presence: true, numericality: { greater_than_or_equal_to: 0 }
end

class Nutrient < ActiveRecord::Base
  has_many :product_nutrients, dependent: :restrict_with_error
  has_many :products, through: :product_nutrients
  validates :name, presence: true, uniqueness: { case_sensitive: false }
end

class NutrientUnit < ActiveRecord::Base
  has_many :product_nutrients, dependent: :restrict_with_error
  validates :name, presence: true, uniqueness: { case_sensitive: false }
end

ActiveRecord understands that referencing the Product.product_nutrients association twice yields the same result, and does just one database call. But if we add includes(:nutrient), it’s different.

p = Product.find(...);
# Product Load (1.4ms)  SELECT  "products".* FROM "products" WHERE "products"."id" = $1 LIMIT 1

p.product_nutrients.to_a; p.product_nutrients.to_a;
# ProductNutrient Load (1.0ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" = $1

p.product_nutrients.includes(:nutrient).to_a; p.product_nutrients.includes(:nutrient).to_a;
# ProductNutrient Load (0.5ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" = $1
# Nutrient Load (0.4ms)  SELECT "nutrients".* FROM "nutrients" WHERE "nutrients"."id" IN (...)
# ProductNutrient Load (0.3ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" = $1
# Nutrient Load (0.3ms)  SELECT "nutrients".* FROM "nutrients" WHERE "nutrients"."id" IN (...)

There’s probably some good reason for this behaviour, but it’s pretty annoying.

A workaround

In the product model, we have a method to check if all basic ingredients are there. If not, the data we have is most probably incorrect - in all but some corner-cases manufacturers need to specify saturated fat, energy and salt or natrium. And on some places this has consequences for our data pipeline.

class Product
  def has_basic_nutrients?
    codes = product_nutrients.map {|pn| pn.nutrient.code }.compact.uniq
    codes.include?('energy') &&
      codes.include?('fat_saturated') &&
      (codes.include?('salt') || codes.include?('natrium'))
  end
end

Sounds easy enough, right? But for one thing: all methods that call this method, need to know to eager load both product_nutrients and nutrients. That’s inconvenient, since in many cases this is only called for a single product, and #has_basic_nutrients? could just eager load it by itself when needed.

Single-level

This works by default. For querying multiple products, you can preload explicitely, while for a single product, ActiveRecord already loads the whole array at once. Great!

But that’s not enough for our case: we don’t only look at product_nutrients, but also its association nutrient.

Dual-level

That’s where ActiveRecord breaks down, unfortunately. So make it somewhat managable in our codebase (to avoid needing eager loading in all code-paths to Product#has_basic_nutrients?), we check whether the association is loaded, as well as all nutrients are loaded.

class Product
  def has_basic_nutrients?
    # hack to avoid all callers needing to know about preloading
    scope = product_nutrients
    if !scope.loaded? || scope.any? {|pn| !pn.association(:nutrient).loaded? }
      scope = scope.includes(:nutrient)
    end

    codes = scope.map {|pn| pn.nutrient.code }.uniq
    codes.include?('energy') &&
      codes.include?('fat_saturated') &&
      (codes.include?('salt') || codes.include?('natrium'))
  end
end

If product_nutrients aren’t loaded at all, we need to load them anyway - and we just as well include the nutrients too. If product_nutrients are loaded, we need to check if nutrients are loaded. This is needed for each product_nutrient, and I’m afraid we need to iterate them all (assuming database calls are more expensive then this).

While ActiveRecord relations have a #loaded? method, this is not the case for a belongs_to or has_one relation, and so we use the association method instead.

And does it work?

p = Product.includes(:product_nutrients => :nutrient).find(...)
# Product Load (0.7ms)  SELECT  "products".* FROM "products" WHERE "products"."id" = $1 LIMIT 1
# ProductNutrient Load (0.6ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" IN (.)
# Nutrient Load (0.3ms)  SELECT "nutrients".* FROM "nutrients" WHERE "nutrients"."id" IN (...)
p.has_basic_nutrients?
=> false

p = Product.find(...)
# Product Load (1.4ms)  SELECT  "products".* FROM "products" WHERE "products"."id" = $1 LIMIT 1
p.has_basic_nutrients?
# ProductNutrient Load (0.8ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" = $1
# Nutrient Load (0.4ms)  SELECT "nutrients".* FROM "nutrients" WHERE "nutrients"."id" IN (...)
=> false

Yes, it eager loads correctly, both when specified explicitely at product-load and when that is omitted.

Unfortunately, calling the method twice without eager loading does not keep the data :(

p.has_basic_nutrients?
# ProductNutrient Load (0.8ms)  SELECT "product_nutrients".* FROM "product_nutrients" WHERE "product_nutrients"."product_id" = $1
# Nutrient Load (0.4ms)  SELECT "nutrients".* FROM "nutrients" WHERE "nutrients"."id" IN (...)
=> false

But for now, we’re reasonably happy.