Many algorithms cannot handle duplicates in their input. The specific problems that we were working with don’t require the input to be ordered in any particular way. What is the fastest way to remove duplicates from a list?
The problem
What I used up to this point to deduplicate a list is the simple:
import Set
deduplicate : List comparable -> List comparable
deduplicate list =
list
|> Set.fromList
|> Set.toList
And if the value types is not comparable:
import Dict
deduplicateBy : (a -> comparable) -> List a -> List a
deduplicateBy toComparable list =
List.foldl
(\value accum -> Dict.insert (toComparable value) value accum)
Dict.empty
list
|> Dict.values
I think this is an idiom that I took from Python, and I always assumed this was fast (enough). But Ian (@ianmackenzie) suggested a List.sort
followed by a function to remove equal successive elements.
My intuition said that this should be slower. List.sort
uses javascript array sorting, which means that to sort an elm (linked) list, it has to be converted into a javascript array, sorted, then converted back into an elm list. After that another pass over the list is needed to remove the duplicates.
But, well, the benchmark results speak for themselves:
Taking as input
almostUniqueNumbers =
List.range 0 256
|> List.reverse
|> (\list -> 42 :: 79 :: list)
The list-based method is more than 10 times as fast:
The input is close to ideal (few duplicates, almost sorted although from large to small) but not unrealistic. In practice, it’s unlikely that there are many duplicates in the input. When there are a lot of duplicates the difference is still pretty big:
Wut?
Where does this difference come from? I don’t think I’m the person who can answer that conclusively it seems to come down to the fact that memory footprint of the Set
is larger than an array, and allocation is expensive.
The list-based method performs 4 traversals over the data. An important detail is that javascript uses quicksort, a sorting algorithm that sorts in-place and uses very little extra space.
List a
=> JSArray a -- convert to js array
=> JSArray a -- sort
=> List a -- convert to elm list
=> List a -- deduplicate that list
In contrast, the set-based method only performs 2 traversals.
List a
=> Set a -- convert to set
=> List a -- convert to list
My naive mind just stopped there: 2 vs. 4 traversals, surely 2 is better. But somehow, the creation of the set is much slower than creating and sorting a javascript array. I see two reasons:
Set
s use more memory per item, because they are a tree structure. The array sorting happens in-place and needslog n
auxiliary spaceThe input seems to cause lots of rebalancing: there are few duplicates so the tree has to grow for almost every insert. When there are more duplicates the
Set
approach performs less poorly.
So in conclusion, allocation is really expensive, and can be more expensive than traversals over the data.
Code
import Benchmark exposing (..)
import Benchmark.Runner exposing (BenchmarkProgram, program)
import Html exposing (Html)
import Set
main : BenchmarkProgram
main =
program arrayDictBenchmark
arrayDictBenchmark =
let
uniqueNumbers =
List.range 0 256
|> List.reverse
|> (\list -> 42 :: 79 :: list)
numbersWithDuplicates =
List.repeat 50 (List.range 0 16)
|> List.concat
in
describe "unique elements in list"
[ -- nest as many descriptions as you like
Benchmark.compare "without duplicates"
"Set"
(\_ -> Set.toList (Set.fromList uniqueNumbers))
"Sort + deduplicate"
(\_ -> deduplicate (List.sort uniqueNumbers))
, Benchmark.compare "with duplicates"
"Set"
(\_ -> Set.toList (Set.fromList numbersWithDuplicates))
"Sort + deduplicate"
(\_ -> deduplicate (List.sort numbersWithDuplicates))
]
deduplicate : List a -> List a
deduplicate list =
let
helper accum previousElement remaining =
case remaining of
[] ->
accum
first :: rest ->
if first == previousElement then
helper accum previousElement rest
else
helper (first :: accum) first rest
in
case list of
[] ->
[]
x :: xs ->
x :: helper [] x xs