Looping through large and slow datasets in Ruby

by Martin Westin in ,


I recently had the pleasure of needing to load 30'000 records from MongoDB and then performing slow and memory intensive processing on them. Basically you can imagine it as a database of videos and MongoDB was holding the metadata and other bits but the actual video files were on disk somewhere. My parsing involved loading in the entire video in memory and doing "stuff" with it as part of my model object. This is how I did it.

Take 1

At first I just did the normal Model.all.each... This worked fine for smaller datasets but on larger sets the whole thing would crash after 40 to 60 minutes (I never timed this in detail). MongoDB had timed out and I figured out that my ODM (Mongoid) was keeping an open iterator in MongoDB and fetching one document at a time from the DB... and after an hour or so the DB had had enough.

Take 2

It was of-course trivial to force Mongoid to load the whole dataset in one go using Model.all.to_a.each... Before thinking further I set this version going. It crashed a lot faster than the first version. The reason is that each of my objects stay in the array, and in memory, and adding anywhere from 5 to 500 MB of videodata to each quickly ate all RAM I had.

Take 3

The small and funky change fixed this, making my script both time and ram "proof". This is how I will start out next time I have a long-running task.

all = Model.all.to_a # these are just simple Rails models
while one = all.pop # this is memory management
    one.do_heavy_processing # this loads in a ton of crap
end

By popping them off one by one and re-using the local variable Ruby's GC takes pretty good care of keeping the memory to a minimum.


Transitioning to more secure passwords

by Martin Westin in


With all the news of hacked databases (mostly at Sony) and the clear-text or poorly hashed passwords in their datasets, I thought I might offer my standard trick for transitioning to a more secure form of hashing. I think some sites don't change passwords security for fear of annoying users or the workload involved in managing a transition. This simple technique is completely invisible to the user and very low maintenance for the developer.

I will be giving examples from the Devise library for Rails apps, since I recently implemented it there.

The technique is very very simple

You configure your authentication to check passwords against both the old and the new form of hashed password. And when you find a match for the old hash you update your database with the version of the password encoded using the new hash. You keep this dual check in place until all (or most likely most) of your users have logged in and had their passwords changed. The unlucky few can use your password recovery feature if you have one.

Metacode of the basic principle:

if new_hash(password) == stored_password
  // ALLOW LOGIN USING AN UP-TO-DATE PASS
else
  if old_hash(password) == stored_password
    // UPDATE PASSWORD IN DB
    // ALLOW LOGIN
  else
    // DISALLOW LOGIN
  end
end

How to implement this transition in Devise

I implemented this by overriding the method valid_password? injected into your User model.

class User

  def valid_password?(incoming_password)
    result = super incoming_password
    if !result
      # try old encryptor during transition
      digest = Devise::Encryptors::LegacyEncryptor.digest(incoming_password, self.class.stretches, self.password_salt, self.class.pepper)
      result = Devise.secure_compare(digest, self.encrypted_password)
      if result
        # update password to use new encryptor when there is a match
        self.password = incoming_password
        self.save
      end
    end
    result
  end

end

Fairly simple. You may need to hard-code some parameters (salt, stretching, pepper) if they cause problems.

If you are changing from, say, sha1 to sha256, you can easily check the character lengths of the passwords in your database to check the "adoption rate" of the new hashes.

Implications on Security

You should realize that you ARE lowering your security level slightly by effectively allowing 2 different password checks. In reality this problem is small and only really matters if you have plain passwords you are transitioning from (and you really shouldn't have). The problem then becomes real since I could login using a stolen new (supposedly) secure hash as the given password. In this case I would definitely disallow any password of the same length as, or simple reg-ex match for, your new hashing system to avoid this hole.

You will also not fully benefit from the new hashing system until you remove the "dual check" after a reasonable period of time.

If you can live with that to gain the benefits of a clean migration for you and your users this is a nice technique. I know from reading and talking to developers that I am far from the only of the first to come up with something like this. Many apps and sites have used and continue to use this kind of technique to beef-up password hash-strength without bothering users.


Graylog2 on Mac OS X

by Martin Westin in


I have been playing with Graylog2 on my Mac today. Since the setup guides are all for Debian and not fully compatible with Mac OS X I thought I'd mention the changes I needed to make to get thing rolling smoothly. The guides are good, so go read them in the wikis on Github. I won't re-iterate them, only point out the minor changes and tweaks I had to make.

Graylog2 comes in two main parts. The server and the web interface. I'll start with the server component.

Install The Server

https://github.com/Graylog2/graylog2-server/wiki/Installing

Mac OS X has java bundled with the OS (for now). There is no need to install anything. The configuration file needs one non-obvious tweak.


mongodb_host = 127.0.0.1 # localhost

Java resolves localhost to the strangest thing. It tries to connect to the Bonjour name and external IP (e.g. Martin's Mac/192.168.0.2) instead of 127.0.0.1 which is what you want. Instead of opening MongoDB up to external access I changed the configuration to point to the loopback IP directly.

Starting The Server

https://github.com/Graylog2/graylog2-server/wiki/Starting-the-server

I didn't get the daemon script to start and did not investigate is since I run Graylog2 for evaluation and development and like seeing the output. Starting by running the jar file requires that you sudo.


sudo java -jar graylog2-server.jar debug

That gets Graylog2 running and spitting out a lot of fun info so you know you are logging thing as you expect.

Installing The Web Interface

https://github.com/Graylog2/graylog2-web-interface/wiki/Installing-the-web-interface-on-Debian-5.0

You can follow most of those steps if you don't have rails and Bundler and that stuff installed. For testing and development, I would suggest running the interface using Passenger Standalone instead of Apache. And, you install Passenger as a gem and not apt, of-course.

http://www.modrails.com/documentation/Users%20guide%20Standalone.html

The cool thing about installing passenger standalone is that it will compile and run itself the first time you call passenger start. It will take a few minutes that first time but after that it will start instantly.

Logging from your Rails app

https://github.com/Graylog2/graylog2_exceptions

In the Rails app I want to log from I installed Graylog2 Exceptions. It is a small Rack middleware with practically no configuration. Only problem is that it has not been updated to comply with the current version of the Graylog2 server. Until it is updated, you have to modify the source for it. A very small mod. For me it is ok as long as I am still on my Mac and not a server.

first

> cd /to/my/app/dir
> bundle open graylog2_exceptions

This should get you the installed gem open in your editor. In the file lib/graylog2_exceptions.rb you need to add the version parameter to the notification message. Possibly this should be added to the gelf gem instead. I am not sure how that version string is supposed to be used.

Here is the modified method that does the actual notification:

  def send_to_graylog2 err
    begin
      notifier = GELF::Notifier.new(@args[:hostname], @args[:port])
      puts notifier.notify!(
        :version => "1.0",
        :short_message => err.message, # <- this line is new!!!
        :full_message => err.backtrace.join("\n"),
        :level => @args[:level],
        :host => @args[:local_app_name],
        :file => err.backtrace[0].split(":")[0],
        :line => err.backtrace[0].split(":")[1]
      )
    rescue => i_err
      puts "Graylog2 Exception logger. Could not send message: " + i_err.message
    end
  end

So, that is it. Finally I get all my exceptions in Graylog2. To try it out you can just raise some dummy exception – raise "Dummy Exception Error" – here and there and see them pop up in Graylog2.


Rails migration of indexes

by Martin Westin in ,


A small gotcha when changing indexes in a migration. To change an index one has to first remove it and then add it again. Removing an index is the tricky part. The documentation states: remove_index(table_name, index_name): Removes the index specified by index_name.

This is not strictly true as it turns out. The docs should probably say: remove_index(table_name, column_name)

The crux is that one cannot use this syntax to remove a named index. Rails assumes the index is named something like "tablename_columnname_index" or something similar.

To remove a named index one has to use the block syntax afaik:

change_table :tablename do |t|
  t.remove_index :name => :indexname
  t.index ["columnname"], :name => "indexname", :unique => true
end

Offensively lazy web developers

by Martin Westin in ,


There exist many websites and web applications with elements of poor usability, engineering, design and so on. Some exhibit features so poor I can only attribute them to laziness. Some actually make me feel offended that I have to jump through hoops to accommodate their laziness. Top of my list are form fields for postcodes, phone numbers, dates, times or any similar type of data. Simple numeric data. Easily validated and normalized. Yet I often see requirements to enter the data in a very specific format, often at odds with how humans customarily write such data. To most people (who are not developers) it can be very confusing to enter data in a perfectly normal way and have a computer tell them it is invalid. For me it is offensive since I know it is practically always the result of data not being normalized. A developer that does not normalize input data before validating it can only be described as lazy or, if you prefer, incompetent.

Postcodes

Nothing can be simpler, right. In many countries it is simply a few digits. Some through a few alphabetic characters in there. I'll focus on local Swedish websites and Swedish postcodes. We have 5 digit postcodes. They are typically written "123 45". So Why would any Swedish website validate that input and claim it is invalid. As a developer I know it is probably that a space is not a numeric character and that adding it also make the string 6 characters long and not 5.

Phone numbers

A phone number is a string of numeric characters. Anything else: spaces, parenthesis, dashes and other chrome, is just that: chrome. None of that is part of the data. No server at ATT, Telia, Vodafone or any other network carrier reads these thing and need them in order to route a call. Quite the opposite. Any phone network and particularly cellular networks require a very specific internationally standardized format. Guess what? It is all numeric. A Swedish cellphone might be typed "0701 - 23 45 67" in my address book but the phone sends 46701234567 to the network anytime I make a call.

Dates and Times

These are a bit more complex than the above. But the same principles apply. Then again, most web-focused programming languages have functions to parse a myriad of ways one could type a date or time and create a proper date or time object or data type. If you still find it does not work for you then, in this case, you should provide something other than a blank test field. Try googling for datepicker or timepicker. The problem will be one of choosing your favorite rather than finding anything at all.

Being constructive

I have ranted a bit now so I thought I'd put my code where my mouth is. Since I am not allowed to share my phone number normalizer code I wrote at work I thought I'd at least share the secret to all the normalizing tasks above... It's regex. Regular Expressions.

Getting rid of whitespace

The following will "match" whitespace characters. By substituting the matches with nothing you will simply remove all whitespace from a string. /\s/

Remove anything that is not a number

And the following will remove non digits if you substitute the matches with nothing. /\D/

A very very very simple example in Ruby


num = "(0)701 - 23 45 67".gsub(/\D/,'')

Seriously. That is all it takes to turn an offensive web form into a more humane one.


SOAP with Attachments in Ruby

by Martin Westin in


I found myself once again facing SOAP. This abomination of a protocol they even have the nerve to call "web services" is not my favorite type of API to interface with (how did you guess?). I think probably the only language with any decent support is Java and possibly .net. Neither rank among my favorite languages either. Funny that. My bigger problem is that the service I am interfacing with is noting as simple as sending an integer and getting an integer back. It requires that I post a multipart/mime SOAP message (aka SOAP with Attachments afaik). This is something that most SOAP libraries are not too keen on supporting.

What are multipart SOAP messages?

In short they are encoded a kind-of like email messages and their attachments but sent using http to a SOAP endpoint. The normal SOAP message becomes one of the mime parts and any other parts are called attachments and usually referenced from inside the SOAP message.

A little history

A few years ago in PHP I was stuck using NuSOAP and ended up basically bypassing most of NuSOAP and encoding the attachments and doing all that myself. The code was a real mess.

Last week I got to do it all over again. This time in Ruby. At work, we are porting our entire platform to Ruby, but detailing that process might be a post in itself. I was so happy when I found that soap4r has support for mime messages. Then I tried to use soap4r. Long story short. I liked it so much I chose to go with Savon instead... which has no mime support.

What I ended up with

The results of my efforts is not pretty by Ruby standards but a lot better than my old code in php. I patched Savon in two places. One to enable any namespace on the SOAP body (which is otherwise hard-coded to "wsdl") and has little to do with mime messages.

The other place was to intercept the output and check if the SOAP object had any attachments (parts) added to it. If so, it will take the intended output and encode that as a mime part and then encode the other parts and put it all together as a nice big http packet ready for posting.

I think it best if I just show the code now.

Any questions posted to the gist or here will be adressed to the best of my abilities.


Nginx + Wordpress caching that actually works

by Martin Westin in ,


I spent a lot of time yesterday trying to enable WP Super Cache, and subsequently W3 Total Cache for this website. SInce none of the hits I got on Google did the trick I thought I'd post my working settings for page caching with W3 Total Cache. I went with this plugin mainly because it uses a logical hierarchy of readable folders and files. WP Super Cache did not which is why I eventually dropped it and tried the Total Cache plugin.

I have Nginx and PHP FastCGI. No Apache. The "problem" with this setup is the rewrite rules needed to point visitors to cached pages if they exist. Installing the plugin is as simple as anything in WP these days so I wont go into that.

Here it is:

## W3 Total CACHE BEGIN
set $totalcache_file '';
set $totalcache_uri $request_uri;

if ($request_method = POST) {
  set $totalcache_uri '';
}

# Using pretty permalinks, so bypass the cache for any query string
if ($query_string) {
  set $totalcache_uri '';
}

if ($http_cookie ~* "comment_author_|wordpress|wp-postpass_" ) {
  set $totalcache_uri '';
}

# if we haven't bypassed the cache, specify our totalcache file
if ($totalcache_uri ~ ^(.+)$) {
  set $totalcache_file /wp-content/w3tc/pgcache/$1/_index.html;
}

# only rewrite to the totalcache file if it actually exists
if (-f $document_root$totalcache_file) {
  rewrite ^(.*)$ $totalcache_file;
  break;
}                 

##W3 Total CACHE END

If you are wondering what to do with these lined of code... I did not come up with them myself. I did a minor change to make them work with the current version of WP Total Cache and my installation.

The blueprint came from here: http://wpveda.com/nginx-rewrite-rules-for-w3-total-cache-plugin/

What I did was to change the filename. Also. If you, like me, have wp in a folder you would add that to the path as well.

set $totalcache_file /wp-content/w3tc/pgcache/$1/_index.html;
set $totalcache_file /your-wp-folder/wp-content/w3tc/pgcache/$1/_index.html;

There are a dozen other sites with variations on this code. None worked for me straight away... If you have similar experience, maybe my version will work for you.

When, or if, I get some of the other rewrite-dependent features working I'll add another post.


Implementing GUI persistence in an iPhone App

by Martin Westin in ,


With iOS 4, Apple pushes everyone to build our apps so that we preserve the state of the application when it terminates. This is because to the normal user there is no difference between an app being "pushed" to the background and an app being terminated. Apple want the users to feel like our apps never terminate. That they just leave them in the background a while. I'll explain how I implemented this behaviour using NSUserDefaults in my app, Extraction. It may not be the most advanced technique or the best in any way. I just know it works for me.

Read More

Modifying a Projekktor Theme

by Martin Westin in ,


Projekktor is a most excellent open-source video player for the web. It requires javascript but that is totally worth it. You get cross-browser compatibility, flash fallback, consistent GUI across browsers and much more. What i will explain in this short article is how to modify the GUI part. The first thing I wanted to know was: What does the html look like? Since it is generated by javascript it is not readily apparent so here is the html structure of the video "controls".

Wrath - YOUTUBE Flash API 2/4
00:18 / 03:46

The enclosing div (classed ppcontrols) is itself enclosed inside a set of tags that are positioned and make fixed-size box (the size of the video element). This makes things nice and workable. You can position and set the size for ppcontrols and then place each element in relation to this div.

I will continue to call each control element by it's class name since these are what you target in your css. Just to make things really obvious, here is a screenshot with the class-names added.

Projekktor Controls

If you want a different look but like the general dimensions you can just start replacing the graphics in the theme folder of projekktor.

To my eyes the standard theme looks very nice. I wanted to keep the same look but make it resize to fit various video sizes. To achieve this I needed to make the size of the main ppcontrols div relative to the video frame. Instead of a fixed width I ended up with this css.

.ppcontrols {  
    position: relative;
    left: 20px;
    margin-right: 40px;
    display: block;
    width: auto;
    height: 87px;
    margin-top: -110px;
    border: solid 2px #fff;
    -webkit-border-radius: 12px;
    -moz-border-radius: 12px;
    border-radius: 12px;
    background: transparent url(projekktor/ctrl-bg.png) 0 0 repeat-x;
    z-index:8000;
    padding: 0px;
}

This creates a control area that always stretches to 20px from the left and right edges. As you can see I also added a css border (with radius) and a new background to get close to the original graphics of the theme while still being flexible horizontally.

The I proceeded to position the elements relative to one of the edges or the middle of the controls area. For example:

.ppfsenter {
...
    top: 50px;
    right:10px;
...
}
.ppplay {
...
    left: 50%;
    margin-left: -17px;
...
}
.ppscrubber {
...
    left: 13px;
    right: 13px;
...
}

With these minor changes to the css the full-screen buttons stick to the right edge and the play/pause buttons stick to the center. Items like the scrubber and the title are positioned in a similar way to the main ppcontrols div. I chose a different css to get the results this time ... just for the variation. Use whichever one you prefer.

The results of my modifications are available for download if you want to take a look at all of it or give it a test-drive.

"demo" (just the v0.6.1 release unpacked and edited) http://dev.eimermusic.com/projekktor/readme.html

the css and images in a zip-file http://dev.eimermusic.com/projekktor/projekktor_theme_fluid.zip
http://dev.eimermusic.com/projekktor/projekktor_theme_fluid.tar.gz