The performance of our Rails application depends on many variables, and one of those variables is the number of queries executed for completing an action. The less the number of calls to the database, the less the memory allocations and subsequently less the duration to complete the operation.
One such problem is the N + 1 query problem.
If we have two tables projects
and commits
and we load two projects and subsequently all their commits it makes 1 query
to fetch the project and N queries to fetch the commits for each project.
Since addition is
commutative
we can write 1 + N as N + 1.
# projects = Project.where(id: [1, 2])
> SELECT "projects".* FROM "projects" WHERE "projects"."id" IN (1, 2)
# projects.map { |project| project.commits }
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" = $1 [["project_id", 1]]
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" = $1 [["project_id", 2]]
....
The general way to solve N + 1 is to use includes
to eager load the associations.
How does includes
work
includes
is a shorthand for preload
and eager_load
.
preload
initiates two queries, the first to fetch the primary model
and the second to fetch associated models whereas
eager_load
does a left join which initiates one query to fetch both
primary and associated models.
preload
is much better than eager_load
in terms of memory usage.
It’s the default strategy used by Active Record unless we force it to use
the other strategy used by eager_load
.
The reason is documented in the
ActiveRecord::Associations::Preloader
class.
The old documentation for Rails 4.1.7 over at rubydoc is even better.
Suppose that you have an Author model with columns ‘name’ and ‘age’, and a Book model with columns ‘name’ and ‘sales’. Using this strategy, Active Record would try to retrieve all data for an author and all of its books via a single query:
SELECT * FROM authors
LEFT OUTER JOIN books ON authors.id = books.author_id
WHERE authors.name = 'Ken Akamatsu'
However, this could result in many rows that contain redundant data. After having received the first row, we already have enough data to instantiate the Author object. In all subsequent rows, only the data for the joined ‘books’ table is useful, the joined ‘authors’ data is just redundant, and processing this redundant data takes memory and CPU time. The problem quickly becomes worse and worse as the level of eager loading increases (i.e., if Active Record is to eager load the associations’ associations as well).
How does preload
work
It loops over all the associations to preload and makes a query per association.
# Project.preload(:commits)
> Project Load (1.8ms) SELECT "projects".* FROM "projects"
> Commit Load (128.3ms) SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2) [["project_id", 1], ["project_id", 2]]
How does eager_load
work
eager_load
uses LEFT OUTER JOIN for eager loading the records rather than to
preserve all rows on the left side of the join, regardless of a match on the right side.
It’s implemented inside of join dependency where the result returned will only contain one row for each record from the left table.
# Project.eager_load(:commits)
> SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."org" AS t0_r2, "projects"."connected_branch" AS t0_r3,
"projects"."enabled_by" AS t0_r4, "projects"."created_at" AS t0_r5, "projects"."updated_at" AS t0_r6, "projects"."permalink" AS t0_r7,
"projects"."domain" AS t0_r8, "commits"."id" AS t1_r0, "commits"."author" AS t1_r1, "commits"."committer" AS t1_r2, "commits"."message"
AS t1_r3, "commits"."sha" AS t1_r4, "commits"."parents" AS t1_r5, "commits"."project_id" AS t1_r6, "commits"."created_at" AS t1_r7,
"commits"."updated_at" AS t1_r8, "commits"."status" AS t1_r9, "commits"."committed_at" AS t1_r10 FROM "projects"
LEFT OUTER JOIN "commits" ON "commits"."project_id" = "projects"."id"
# Project.includes(:commits).references(:commits).count
# using includes with references also triggers eager_load
> ... same query as above ...
You may notice that the DISTINCT
keyword is automatically added
when doing the count
operation combined with eager_load
.
The DISTINCT
is
added
by Active Record when it detects eager loading
and the SQL operation to be count.
# Project.eager_load(:commits).count
# notice 'DISTINCT' has been automatically added
SELECT COUNT(DISTINCT "projects"."id") FROM "projects" LEFT OUTER JOIN "commits" ON "commits"."project_id" = "projects"."id"
> 2
Eager loading subset of association
Sometimes we just want to eager load a subset of data, like all commits in the queued status. Below we explain how to do just that in Chaining methods to the scope and Scope cannot be eager loaded, convert them to associations
Eager loading with dynamic condition
Following on from the previous section, what if we want to eager load
a subset of data where the filtering criteria is decided dynamically.
For example, eager loading all the commits of the currently logged in user.
We will need to use the undocumented API provided by
ActiveRecord::Associations::Preloader
.
# projects = Project.all.to_a
> SELECT "projects".* FROM "projects"
# ActiveRecord::Associations::Preloader.new.preload(
projects,
:commits,
Commit.where("author ->> 'email' = ?", current_user.email)
)
> SELECT "commits".* FROM "commits" WHERE (author ->> 'email' = > 'current_user_email')
Note: This feature is currently broken in Rails version 6
Eager loading aggregate queries
Rails provides a built-in solution in the form of counter cache
which solves the problem for the COUNT
aggregate function.
There are gems like
activerecord-precounter
if we don’t wish to add and maintain a counter cache.
For other types of aggregate functions like AVG
, SUM
we can use the
eager_group gem.
We can also solve the problem without adding any extra dependencies.
# projects.map {|project| project.commits.count}
> SELECT COUNT(*) FROM "commits" WHERE "commits"."project_id" = $1 [["project_id", 1]]
> SELECT COUNT(*) FROM "commits" WHERE "commits"."project_id" = $1 [["project_id", 2]]
# Commit.where(project_id: projects).group(:project_id).count
> SELECT COUNT(*) AS count_all, "commits"."project_id" AS
commits_project_id FROM "commits" WHERE "commits"."project_id"
IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN ($1, $2))
GROUP BY "commits"."project_id" [["id", 1], ["id", 2]]
# Commit.where(project_id: projects).group(:project_id).sum(:comments)
> SELECT SUM(comments) AS sum_comments, "commits"."project_id" AS
commits_project_id FROM "commits" WHERE "commits"."project_id"
IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN ($1, $2))
GROUP BY "commits"."project_id" [["id", 1], ["id", 2]]
Common mistakes with eager loading
Association has been eager loaded on the wrong owner
Let’s take an example.
class Project < ApplicationRecord
has_many :commits
has_many :posts, through: :commits
end
class Commit < ApplicationRecord
belongs_to :project
has_one :post
end
class Post < ApplicationRecord
belongs_to :commit
end
If we eager load posts on the project but call post on the commit, it makes a query to the database. Since the posts ( target ) have been loaded on the project ( owner ), we can only fetch the loaded posts on the project.
# projects = Project.includes(:commits, :posts)
> SELECT "projects".* FROM "projects"
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2, ...)
> SELECT "posts".* FROM "posts" WHERE "posts"."commit_id" IN ($1, $2, ...)
# projects.first.commits.first.post
>
If we want to call post on an individual commit, we have to eager load it on the commit.
# Project.includes(commits: :post)
> ...
# projects.first.commits.first.post
Using pluck
instead of select
Let’s take an example, if we want to load all the commits for projects under a given organization.
# Commit.where(project_id: Project.where(org: 'rails').pluck(:id))
> SELECT "projects"."id" FROM "projects" WHERE "projects"."org" = $1 [["org", "rails"]]
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN ($1, $2) [["project_id", 1], ["project_id", 2]]
The above code makes two queries, first to select the id
of the projects
and next to fetch the commits.
If pluck is replaced by select
, only 1 query is made with the help of a subquery.
# Commit.where(project_id: Project.where(org: 'rails').select(:id))
> SELECT "commits".* FROM "commits" WHERE "commits"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."org" = $1) [["org", "rails"]]
Note: Use pluck
when you want to select a subset of values rather than the
entire row.
Chaining methods to the scope
If we want to load all posts
for a project
along with a subset
of the loaded posts
like only the ones which have been published
we can perform the filtering in memory ( if the result set is small )
to avoid making two queries to the database
# projects.posts # 1 DB query
# projects.posts.published # 1 DB query
# will save 1 query to the database
# projects.posts.select {|p| p.status.eql?('published') }
Scope cannot be eager loaded, convert them to associations
class Commit < ApplicationRecord
...
scope :queued, -> { where(status: :queued) }
...
end
# projects = Project.includes(:commits)
# makes a query to fetch queued commits
# projects.commits.queued
If we only want to eager load the queued commits ( a subset of commits ), we have to create a scoped association and eager load it instead of all the commits.
class Project < ApplicationRecord
...
has_many :queued_commits, -> { where(status: :queued) }, class_name: 'Commit'
...
end
# projects = Project.includes(:queued_commits)
# projects.queued_commits
Calling exists?
on preloaded data
exists?
always makes a database query. If the data is already loaded,
then we should use present?
to check if the record is not blank.
Alternatively, if we know that we would load the data after checking it exists,
just load the record and check for presence afterwards.
Using size instead of count
count
always makes a database query. If the data is already loaded,
then we can call size
to get the size of the association.
Alternatively, we can always use size
, it makes the database query
only if the association is not loaded.
Summary
In this post, we were introduced to some of the query methods on Active Record that we can leverage to overcome common N+1 issues. Considering different scenarios we can provide hints to Rails on how to reduce the number of objects loaded or queries performed. We will follow up on more updates for speed improvements in a Rails Application.