Rails optimizes ActiveRecord batching for whole table iterations


Active Record batching allows us to iterate over a large number of records in batches. This is useful when processing data in smaller chunks of records at a time, rather than loading all the records at once (which would otherwise cause memory issues).

Before

Let’s create a sample table called “cards” that has only the ID column in it.

Now we populate the table with 1 million records and time how long it takes to iterate over all of them.

playground=# INSERT INTO card SELECT i FROM generate_series(1, 10000000) AS i;
INSERT 0 10000000

Now let’s select all the records and count the total.

irb(main):001:1* Benchmark.bm do |x|
irb(main):002:1* count = 0
irb(main):003:2* x.report do
irb(main):004:3* Card.in_batches do |batch|
irb(main):005:3* count += batch.count
irb(main):006:2* end
irb(main):007:1* end
irb(main):008:0> end
Card Pluck (2.0ms) SELECT "cards"."id" FROM "cards" ORDER BY "cards"."id" ASC LIMIT $1 [["LIMIT", 1000]]
Card Count (11.7ms) SELECT COUNT(*) FROM "cards" WHERE "cards"."id" IN ($1, $2, $3, ... $998, $999, $1000) [["id", 1], ["id", 2], ["id", 3], ...  ["id", 998], ["id", 999], ["id", 1000]]
...
...
...
user system total real
56.641315 2.102757 58.744072 (140.822185)
=> [#<Benchmark::Tms:0x00000001123523d0 @cstime=0.0, @cutime=0.0, @label="", @real=140.82218499993905, @stime=2.102757, @total=58.744071999999996, @utime=56.641315>]

The in_batches command first gets all the required IDs and then constructs a IN query for each batch. This is a very expensive operation and takes a long time to complete, especially when iterating over whole tables.

The query above has actually pulled every number from 1 to 1,000,000 to perform a simple select operation.

The same strategy is used for update and delete batch queries.

After

Thanks to this PR batch query strategy now uses a range-based approach.

irb(main):012:1* Benchmark.bm do |x|
irb(main):013:1* count = 0
irb(main):014:2* x.report do
irb(main):015:3* Card.in_batches do |batch|
irb(main):016:3* count += batch.count
irb(main):017:2* end
irb(main):018:1* end
irb(main):019:0> end
Card Pluck (1.8ms) SELECT "cards"."id" FROM "cards" ORDER BY "cards"."id" ASC LIMIT $1 [["LIMIT", 1000]]
Card Count (3.1ms) SELECT COUNT(*) FROM "cards" WHERE "cards"."id" <= $1 [["id", 1000]]
user system total real
7.335565 0.757182 8.092747 ( 26.716952)
...
...
...
=> [#<Benchmark::Tms:0x00000001148dbdb8 @cstime=0.0, @cutime=0.0, @label="", @real=26.71695200004615, @stime=0.7571819999999999, @total=8.092747, @utime=7.335564999999999>]

The new strategy uses a id >= x AND id <= y query to select records within a range. This is far more efficient and saves significant time when iterating over large tables.

However, this strategy is not suitable for all use cases. For example, if we need to query for a small number of records in a large dataset, the range-based strategy is less efficient than just pulling the required records using the IN strategy. For such cases, we can turn off the new strategy by passing use_ranges: false to the in_batches query.

Join Our Newsletter