efficient-coding-algorithms-language

Simple tips to write efficient (data science) code

Written by Vincent Béchard on 2020-04-21

Writing efficient code starts with these rules

Programming CPU heavy data processing and analysis tasks? Some of my clients are amazed by the execution speed of my code, and I am not a professional software developer! Nevertheless, here are some of my simple tricks to keep in mind when writing code in any language:

Do calculations only once: avoid repeating the same arithmetic, try to store results in memory and reuse them later (yes more management!)
Watch loops contents… if it contains repetitive inefficient code, this code is triggered many times!
Example of inefficient code inside a loop: if/else and switch statements to process user options. Process options outside the loop and write several variants of the loop!
Avoid writing a ton of 1- or 2-lines functions just to make things “cute”. Call stack can become heavy and slow in the computer
Use the built-in language features, for example the vectorized operations in R, Matlab or Python, or the native functions in VBA and JavaScript (they will always be faster than your own implementation)
With object-oriented languages: write classes! Code is always cleaner and simpler to debug.
Having a code profiler? Profile execution and work on most time-consuming steps.

Again, these are just some of the things I keep in mind when I code something… But it helps a lot!

The specific case of statistical programming in data science

In addition to the abovementionned practical tips, there a specific items to consider in the context of data science. Why I am spending time writing on this? Because I am a data scientist and I love programming using the most efficient language or environment! Short illustration: if you need to shuffle a vector, an efficient implementation in C# is:

int[] randomNumbers = Shuffle(Enumerable.Range(0, 11), new Random()).ToArray();
public static IEnumerable<T> Shuffle<T>(this IEnumerable source, Random random)
{
    T[] list = source.ToArray();
    int count = list.Length;
    while (count > 1)
    {
        int index = random.Next(count--);
        T temp = list[index];
        list[index] = list[count];
        list[count] = temp;
    }
    return list;
}

And in R, it is:

v <- sample(v)

Data science code performance comparison

I am sharing some interesting results on code performance. I needed to implement an algorithm in R that I had previously written in pure C#. I ended with 2 versions:

Version 1: direct translation of C# code to R, with explicit loops and using only custom functions
Version 2 : implementation that leveraged the native features such as implicit loops and vectorized operations

statistical-data-science-programming-language

In a statistical programming environment, data manipulations are quite easier! Coming back to my to-be-translated-algorithm, not only the Version 2 code was much shorter, but it was also much faster on various problem sizes:

Size	Version 1 (s)	Version 2 (S)	Ratio V1/V2
1,000	1.07	0.15	7.1
10,000	5.06	1.27	4.0
25,000	11.86	3.21	3.7
50,000	28.78	6.37	3.7
100,000	48.93	14.23	3.4

Bottom line: leveraging the native built-in features of a programming environment can save time to edit the code and time to execute the code!

Want to learn more?

At Différence, our core expertise is centered on statistic and data science, Lean applications and operational excellence, and simulation! We can train, coach and help practitioners to learn how to use statistical programming. Don’t hesitate to ask for more information by contacting us.