Active Record batching allows us to iterate over a large number of records in batches.
This is useful when processing data in smaller chunks of records at a time,
rather than loading all the records
at once (which would otherwise cause memory issues).
Before
Let’s create a sample table called “cards” that has only the ID column in it.
Now we populate the table with 1 million records
and time how long it takes to iterate over all of them.
Now let’s select all the records and count the total.
The in_batches command first gets all the required IDs
and then constructs a IN query for each batch.
This is a very expensive operation
and takes a long time to complete,
especially when iterating over whole tables.
The query above has actually pulled every number from 1 to 1,000,000
to perform a simple select operation.
The same strategy is used for update
and delete batch queries.
After
Thanks to this PR batch query strategy
now uses a range-based approach.
The new strategy uses a id >= x AND id <= y
query to select records within a range.
This is far more efficient and saves significant time when iterating over large tables.
However, this strategy is not suitable for all use cases.
For example, if we need to query for a small number of records in a large dataset,
the range-based strategy is less efficient than just pulling the required records using the IN strategy.
For such cases, we can turn off the new strategy by passing use_ranges: false to the in_batches query.