Looping through large and slow datasets in Ruby

by Martin Westin in ,

I recently had the pleasure of needing to load 30'000 records from MongoDB and then performing slow and memory intensive processing on them. Basically you can imagine it as a database of videos and MongoDB was holding the metadata and other bits but the actual video files were on disk somewhere. My parsing involved loading in the entire video in memory and doing "stuff" with it as part of my model object. This is how I did it.

Take 1

At first I just did the normal Model.all.each... This worked fine for smaller datasets but on larger sets the whole thing would crash after 40 to 60 minutes (I never timed this in detail). MongoDB had timed out and I figured out that my ODM (Mongoid) was keeping an open iterator in MongoDB and fetching one document at a time from the DB... and after an hour or so the DB had had enough.

Take 2

It was of-course trivial to force Mongoid to load the whole dataset in one go using Model.all.to_a.each... Before thinking further I set this version going. It crashed a lot faster than the first version. The reason is that each of my objects stay in the array, and in memory, and adding anywhere from 5 to 500 MB of videodata to each quickly ate all RAM I had.

Take 3

The small and funky change fixed this, making my script both time and ram "proof". This is how I will start out next time I have a long-running task.

all = Model.all.to_a # these are just simple Rails models
while one = all.pop # this is memory management
    one.do_heavy_processing # this loads in a ton of crap

By popping them off one by one and re-using the local variable Ruby's GC takes pretty good care of keeping the memory to a minimum.