TL;DR: If you just want to see the solution to this coding challenge, the files are available on my Github page.
The Challenge
I was recently asked to solve the following coding challenge:
Write a working function to remove all duplicates from an unsorted list of email addresses. Your resulting list should remain in the original order. Your function should be able to handle around 100,000 email addresses containing 50% randomly placed duplicates in well under 1 second on a typical modern laptop.
I was allowed to assume any input data structure I wanted (I chose to shove all the emails into an array). I was asked not to use any built-in library functions that make this problem trivial. (That means I can’t use Ruby’s Array#uniq function, darnit!)
Build Some Tests
I generally use RSpec for my Rails Application testing. Lately, however, I’ve been trying to improve my MiniTest skills, so that’s what I chose to use here.
I put together a really simple MiniTest suite with enough tests to make sure my code is working. First, I wanted to make sure I set up my email list correctly. I wrote tests to verify that my generate_email_list
function generates a list that contains the number of emails I expected, and that half those emails are duplicates.
Because I’m essentially writing a function that does exactly the same thing Array#uniq does, it’s really easy to write a quick test. I simply make sure that my uniquify
function does the same thing with the email array that Array#uniq
does.
My simple test suite:
1 require 'minitest/autorun'
2 require_relative 'email_sort'
3
4 class EmailsortTest < Minitest::Test
5 def test_email_list_setup
6 num_emails = 100_000
7 list = generate_email_list(num_emails)
8 assert_equal list.count/2, list.uniq.count
9 assert_equal list.count, num_emails
10 end
11
12 def test_uniquify
13 list = generate_email_list(100_000)
14 assert_equal list.uniq, uniquify(list)
15 end
16 end
Generate All the Emails
Armed with my test suite, I’m ready to get to work. Let’s build that email list.
It’s trivial to build up an array of unique emails that is half my desired length using a simple loop. After I have that in hand, I multiplied the array by 2
. Array multiplication is interesting. Here’s a quick example of what happens when you multiply an array by an integer:
[A, B, C] * 2 = [A, B, C, A, B, C]
Finally, I use shuffle
to make sure my array elements are in random order. In the end, I have an array on my hands that is made up of 50% duplicates, in random order.
Here’s the full code to generate the email list, given a specific number of emails:
1 def generate_email_list(num_emails)
2 emails = []
3 (1..num_emails/2).each do |n|
4 emails << "user_#{n}@example.com"
5 end
6 (emails*2).shuffle
7 end
The Meat of the Problem: Removing Duplicates
There are multiple ways to solve this problem. I could iterate through each email in my array, then iterate again through the array looking for and eliminating duplicates, but that means I’ll traverse my array N2 times for an array of N elements. That’s never good news.
Enter the Hash. Looking up an element in a Hash is just an O(1) operation; no time wasted traversing an array - the element is there or it’s not. In the uniquify
function below, we iterate through the list of emails, adding the unique ones to the Hash as keys as we go.
1 def uniquify(dup_array)
2 unique_entries = Hash.new
3 dup_array.each do |element|
4 unique_entries[element] = 1 unless unique_entries[element]
5 end
6 unique_entries.keys
7 end
After we’ve traversed the original array of emails, we’re left with the Hash of unique emails. Each unique email is a key inside this hash, so we can use the Hash#keys
method to pull those out and return them from the function. (It’s good to note that the Hash#keys
method returns the keys in the order that they were added to the Hash, thus preserving the order of the original email list.)
Benchmarking - How’d I do?
I wanted to make sure my solution processed 100,000 emails in well under 1 second, and I also wanted to see how I stood up to Ruby’s built-in Array#uniq
function, so I set up some benchmarking to measure my code:
1 Benchmark.bmbm do |bm|
2 emails = generate_email_list(100_000)
3
4 bm.report("uniquify:") do
5 unique_emails = uniquify(emails)
6 end
7
8 bm.report("uniq:") do
9 unique_emails = emails.uniq
10 end
11 end
If you’re not familiar with Benchmark.bmbm
, it’s handy because sometimes garbage collection can skew the results when you call just Benchmark.bm
. With bmbm
, the tests are run twice, once as a “rehearsal” and once “for real”. Ruby’s GC is forced to run after the rehearsal. The idea is that your results will be more accurate this way.
Here’s the results from benchmarking this solution on my MacBook Air:
Rehearsal ---------------------------------------------
uniquify: 0.090000 0.010000 0.100000 ( 0.096580)
uniq: 0.060000 0.000000 0.060000 ( 0.064417)
------------------------------------ total: 0.160000sec
user system total real
uniquify: 0.060000 0.000000 0.060000 ( 0.064737)
uniq: 0.060000 0.000000 0.060000 ( 0.062987)
Not bad! My uniquify
function seems to be holding up against Ruby’s built-in Array#uniq
method.
A Struggle Between Space and Time
For 1 million emails, this solution takes about 1 second to run:
Rehearsal ---------------------------------------------
uniquify: 1.270000 0.050000 1.320000 ( 1.351997)
uniq: 1.540000 0.030000 1.570000 ( 1.605405)
------------------------------------ total: 2.890000sec
user system total real
uniquify: 1.050000 0.020000 1.070000 ( 1.072724)
uniq: 1.130000 0.020000 1.150000 ( 1.156912)
Once we get up to about 10 million emails in our array, it takes about 32 seconds to remove duplicates, but the Ruby process on my machine starts eating up memory.
Benchmarking for 10 Million emails:
Rehearsal ---------------------------------------------
uniquify: 31.010000 1.610000 32.620000 ( 34.354450)
uniq: 38.600000 0.930000 39.530000 ( 40.525922)
----------------------------------- total: 72.150000sec
user system total real
uniquify: 30.800000 0.390000 31.190000 ( 31.790559)
uniq: 36.070000 0.460000 36.530000 ( 37.216896)
For 10 million emails, the Ruby process maxes out at about 1.3 GB of memory. I only have 4 GB of memory on my machine, so processing a billion emails using my uniquify function is not going to be good news.
For anything over 10 million emails, I’d consider eliminating the hash altogether. It might even be a good idea to try moving back to the N2 solution, where we would iterate through the array one email at a time, then iterate over the remaining elements in the array to remove duplicates. This solution would unquestionably take much longer than the Hash method, but we’d be storing a lot less data in memory.
If I could just get my hands on a computer with a huge amount of Memory, life would sure be easy wouldn’t it?