Folkertdev - Deduplicating lists

Deduplicating lists

Posted on Mar 17, 2018 · by Folkert de Vries

Tags: elm

Many algorithms cannot handle duplicates in their input. The specific problems that we were working with don’t require the input to be ordered in any particular way. What is the fastest way to remove duplicates from a list?

The problem

What I used up to this point to deduplicate a list is the simple:

import Set 

deduplicate : List comparable -> List comparable
deduplicate list = 
    list
        |> Set.fromList
        |> Set.toList

And if the value types is not comparable:

import Dict

deduplicateBy : (a -> comparable) -> List a -> List a
deduplicateBy toComparable list = 
    List.foldl 
        (\value accum -> Dict.insert (toComparable value) value accum) 
        Dict.empty 
        list
        |> Dict.values

I think this is an idiom that I took from Python, and I always assumed this was fast (enough). But Ian (@ianmackenzie) suggested a List.sort followed by a function to remove equal successive elements.

My intuition said that this should be slower. List.sort uses javascript array sorting, which means that to sort an elm (linked) list, it has to be converted into a javascript array, sorted, then converted back into an elm list. After that another pass over the list is needed to remove the duplicates.

But, well, the benchmark results speak for themselves:

Taking as input

almostUniqueNumbers =
    List.range 0 256
        |> List.reverse
        |> (\list -> 42 :: 79 :: list)

The list-based method is more than 10 times as fast:

with few duplicates

name

Set

Sort + deduplicate

runs / second

2,112

24,621

% change

+1065.43%

goodness of fit

98.73%

99.02%

The input is close to ideal (few duplicates, almost sorted although from large to small) but not unrealistic. In practice, it’s unlikely that there are many duplicates in the input. When there are a lot of duplicates the difference is still pretty big:

with duplicates

name

Set

Sort + deduplicate

runs / second

2,850

8,855

% change

+210.62%

goodness of fit

99.11%

99.1%

Wut?

Where does this difference come from? I don’t think I’m the person who can answer that conclusively it seems to come down to the fact that memory footprint of the Set is larger than an array, and allocation is expensive.

The list-based method performs 4 traversals over the data. An important detail is that javascript uses quicksort, a sorting algorithm that sorts in-place and uses very little extra space.

List a 
    => JSArray a        -- convert to js array
    => JSArray a        -- sort 
    => List a           -- convert to elm list 
    => List a           -- deduplicate that list

In contrast, the set-based method only performs 2 traversals.

List a 
    => Set a            -- convert to set 
    => List a           -- convert to list

My naive mind just stopped there: 2 vs. 4 traversals, surely 2 is better. But somehow, the creation of the set is much slower than creating and sorting a javascript array. I see two reasons:

Sets use more memory per item, because they are a tree structure. The array sorting happens in-place and needs log n auxiliary space
The input seems to cause lots of rebalancing: there are few duplicates so the tree has to grow for almost every insert. When there are more duplicates the Set approach performs less poorly.

So in conclusion, allocation is really expensive, and can be more expensive than traversals over the data.

Code

run as an Ellie

import Benchmark exposing (..)
import Benchmark.Runner exposing (BenchmarkProgram, program)
import Html exposing (Html)
import Set


main : BenchmarkProgram
main =
    program arrayDictBenchmark


arrayDictBenchmark =
    let
        uniqueNumbers =
            List.range 0 256
                |> List.reverse
                |> (\list -> 42 :: 79 :: list)

        numbersWithDuplicates =
            List.repeat 50 (List.range 0 16)
                |> List.concat
    in
    describe "unique elements in list"
        [ -- nest as many descriptions as you like
          Benchmark.compare "without duplicates"
            "Set"
            (\_ -> Set.toList (Set.fromList uniqueNumbers))
            "Sort + deduplicate"
            (\_ -> deduplicate (List.sort uniqueNumbers))
        , Benchmark.compare "with duplicates"
            "Set"
            (\_ -> Set.toList (Set.fromList numbersWithDuplicates))
            "Sort + deduplicate"
            (\_ -> deduplicate (List.sort numbersWithDuplicates))
        ]


deduplicate : List a -> List a
deduplicate list =
    let
        helper accum previousElement remaining =
            case remaining of
                [] ->
                    accum

                first :: rest ->
                    if first == previousElement then
                        helper accum previousElement rest
                    else
                        helper (first :: accum) first rest
    in
    case list of
        [] ->
            []

        x :: xs ->
            x :: helper [] x xs