Improving Database performance and overcoming common N+1 issues in Active Record using includes, preload, eager_load, pluck, select, exists?


The performance of our Rails application depends on many variables, and one of those variables is the number of queries executed for completing an action. The less the number of calls to the database, the less the memory allocations and subsequently less the duration to complete the operation.

One such problem is the N + 1 query problem. If we have two tables projects and commits and we load two projects and subsequently all their commits it makes 1 query to fetch the project and N queries to fetch the commits for each project. Since addition is commutative we can write 1 + N as N + 1.

# projects = Project.where(id: [1, 2])
> SELECT "projects".* FROM "projects" WHERE "projects"."id" IN (1, 2)

# projects.map { |project| project.commits }
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" = $1  [["project_id", 1]]
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" = $1  [["project_id", 2]]
....

The general way to solve N + 1 is to use includes to eager load the associations.

How does includes work

includes is a shorthand for preload and eager_load.

preload initiates two queries, the first to fetch the primary model and the second to fetch associated models whereas eager_load does a left join which initiates one query to fetch both primary and associated models.

preload is much better than eager_load in terms of memory usage. It’s the default strategy used by Active Record unless we force it to use the other strategy used by eager_load. The reason is documented in the ActiveRecord::Associations::Preloader class.

The old documentation for Rails 4.1.7 over at rubydoc is even better.

Suppose that you have an Author model with columns ‘name’ and ‘age’, and a Book model with columns ‘name’ and ‘sales’. Using this strategy, Active Record would try to retrieve all data for an author and all of its books via a single query:

SELECT * FROM authors
LEFT OUTER JOIN books ON authors.id = books.author_id
WHERE authors.name = 'Ken Akamatsu'

However, this could result in many rows that contain redundant data. After having received the first row, we already have enough data to instantiate the Author object. In all subsequent rows, only the data for the joined ‘books’ table is useful, the joined ‘authors’ data is just redundant, and processing this redundant data takes memory and CPU time. The problem quickly becomes worse and worse as the level of eager loading increases (i.e., if Active Record is to eager load the associations’ associations as well).

How does preload work

It loops over all the associations to preload and makes a query per association.

# Project.preload(:commits)
> Project Load (1.8ms)  SELECT "projects".* FROM "projects"
> Commit Load (128.3ms)  SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2) [["project_id", 1], ["project_id", 2]]

How does eager_load work

eager_load uses LEFT OUTER JOIN for eager loading the records rather than to preserve all rows on the left side of the join, regardless of a match on the right side.

It’s implemented inside of join dependency where the result returned will only contain one row for each record from the left table.

# Project.eager_load(:commits)
> SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."org" AS t0_r2, "projects"."connected_branch" AS t0_r3,
  "projects"."enabled_by" AS t0_r4, "projects"."created_at" AS t0_r5, "projects"."updated_at" AS t0_r6, "projects"."permalink" AS t0_r7,
  "projects"."domain" AS t0_r8, "commits"."id" AS t1_r0, "commits"."author" AS t1_r1, "commits"."committer" AS t1_r2, "commits"."message"
  AS t1_r3, "commits"."sha" AS t1_r4, "commits"."parents" AS t1_r5, "commits"."project_id" AS t1_r6, "commits"."created_at" AS t1_r7,
  "commits"."updated_at" AS t1_r8, "commits"."status" AS t1_r9, "commits"."committed_at" AS t1_r10 FROM "projects"
  LEFT OUTER JOIN "commits" ON "commits"."project_id" = "projects"."id"

# Project.includes(:commits).references(:commits).count
# using includes with references also triggers eager_load
> ... same query as above ...

You may notice that the DISTINCT keyword is automatically added when doing the count operation combined with eager_load.

The DISTINCT is added by Active Record when it detects eager loading and the SQL operation to be count.


# Project.eager_load(:commits).count
# notice 'DISTINCT' has been automatically added
  SELECT COUNT(DISTINCT "projects"."id") FROM "projects" LEFT OUTER JOIN "commits" ON "commits"."project_id" = "projects"."id"
> 2

Eager loading subset of association

Sometimes we just want to eager load a subset of data, like all commits in the queued status. Below we explain how to do just that in Chaining methods to the scope and Scope cannot be eager loaded, convert them to associations

Eager loading with dynamic condition

Following on from the previous section, what if we want to eager load a subset of data where the filtering criteria is decided dynamically. For example, eager loading all the commits of the currently logged in user. We will need to use the undocumented API provided by ActiveRecord::Associations::Preloader.

# projects = Project.all.to_a
> SELECT "projects".* FROM "projects"

# ActiveRecord::Associations::Preloader.new.preload(
  projects,
  :commits,
  Commit.where("author ->> 'email' = ?", current_user.email)
  )
> SELECT "commits".* FROM "commits" WHERE (author ->> 'email' = > 'current_user_email')

Note: This feature is currently broken in Rails version 6

Eager loading aggregate queries

Rails provides a built-in solution in the form of counter cache which solves the problem for the COUNT aggregate function. There are gems like activerecord-precounter if we don’t wish to add and maintain a counter cache.

For other types of aggregate functions like AVG, SUM we can use the eager_group gem.

We can also solve the problem without adding any extra dependencies.

# projects.map {|project| project.commits.count}
> SELECT COUNT(*) FROM "commits" WHERE "commits"."project_id" = $1  [["project_id", 1]]
> SELECT COUNT(*) FROM "commits" WHERE "commits"."project_id" = $1  [["project_id", 2]]

# Commit.where(project_id: projects).group(:project_id).count
> SELECT COUNT(*) AS count_all, "commits"."project_id" AS
  commits_project_id FROM "commits" WHERE "commits"."project_id"
  IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN ($1, $2))
  GROUP BY "commits"."project_id"  [["id", 1], ["id", 2]]

# Commit.where(project_id: projects).group(:project_id).sum(:comments)
> SELECT SUM(comments) AS sum_comments, "commits"."project_id" AS
  commits_project_id FROM "commits" WHERE "commits"."project_id"
  IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN ($1, $2))
  GROUP BY "commits"."project_id"  [["id", 1], ["id", 2]]

Common mistakes with eager loading

Association has been eager loaded on the wrong owner

Let’s take an example.

class Project < ApplicationRecord
  has_many :commits
  has_many :posts, through: :commits
end

class Commit < ApplicationRecord
  belongs_to :project
  has_one :post
end

class Post < ApplicationRecord
  belongs_to :commit
end

If we eager load posts on the project but call post on the commit, it makes a query to the database. Since the posts ( target ) have been loaded on the project ( owner ), we can only fetch the loaded posts on the project.

# projects = Project.includes(:commits, :posts)
> SELECT "projects".* FROM "projects"
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2, ...)
> SELECT "posts".* FROM "posts" WHERE "posts"."commit_id" IN ($1, $2, ...)

# projects.first.commits.first.post
>

If we want to call post on an individual commit, we have to eager load it on the commit.

# Project.includes(commits: :post)
> ...

# projects.first.commits.first.post

Using pluck instead of select

Let’s take an example, if we want to load all the commits for projects under a given organization.

# Commit.where(project_id: Project.where(org: 'rails').pluck(:id))
> SELECT "projects"."id" FROM "projects" WHERE "projects"."org" = $1  [["org", "rails"]]
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2)  [["project_id", 1], ["project_id", 2]]

The above code makes two queries, first to select the id of the projects and next to fetch the commits. If pluck is replaced by select, only 1 query is made with the help of a subquery.

# Commit.where(project_id: Project.where(org: 'rails').select(:id))
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."org" = $1)  [["org", "rails"]]

Note: Use pluck when you want to select a subset of values rather than the entire row.

Chaining methods to the scope

If we want to load all posts for a project along with a subset of the loaded posts like only the ones which have been published we can perform the filtering in memory ( if the result set is small ) to avoid making two queries to the database

# projects.posts # 1 DB query
# projects.posts.published # 1 DB query

# will save 1 query to the database
# projects.posts.select {|p| p.status.eql?('published') }

Scope cannot be eager loaded, convert them to associations

class Commit < ApplicationRecord
  ...

  scope :queued, -> { where(status: :queued) }

  ...
end

# projects = Project.includes(:commits)

# makes a query to fetch queued commits
# projects.commits.queued

If we only want to eager load the queued commits ( a subset of commits ), we have to create a scoped association and eager load it instead of all the commits.

class Project < ApplicationRecord
  ...

  has_many :queued_commits, -> { where(status: :queued) }, class_name: 'Commit'

  ...
end

# projects = Project.includes(:queued_commits)
# projects.queued_commits

Calling exists? on preloaded data

exists? always makes a database query. If the data is already loaded, then we should use present? to check if the record is not blank. Alternatively, if we know that we would load the data after checking it exists, just load the record and check for presence afterwards.

Using size instead of count

count always makes a database query. If the data is already loaded, then we can call size to get the size of the association. Alternatively, we can always use size, it makes the database query only if the association is not loaded.

Summary

In this post, we were introduced to some of the query methods on Active Record that we can leverage to overcome common N+1 issues. Considering different scenarios we can provide hints to Rails on how to reduce the number of objects loaded or queries performed. We will follow up on more updates for speed improvements in a Rails Application.