Simple tips to write efficient (data science) code
Written by Vincent Béchard on 2020-04-21
Writing efficient code starts with these rules
Programming CPU heavy data processing and analysis tasks? Some of my clients are amazed by the execution speed of my code, and I am not a professional software developer! Nevertheless, here are some of my simple tricks to keep in mind when writing code in any language:
- Do calculations only once: avoid repeating the same arithmetic, try to store results in memory and reuse them later (yes more management!)
- Watch loops contents… if it contains repetitive inefficient code, this code is triggered many times!
- Example of inefficient code inside a loop: if/else and switch statements to process user options. Process options outside the loop and write several variants of the loop!
- Avoid writing a ton of 1- or 2-lines functions just to make things “cute”. Call stack can become heavy and slow in the computer
- With object-oriented languages: write classes! Code is always cleaner and simpler to debug.
- Having a code profiler? Profile execution and work on most time-consuming steps.
The specific case of statistical programming in data science
In addition to the abovementionned practical tips, there a specific items to consider in the context of data science. Why I am spending time writing on this? Because I am a data scientist and I love programming using the most efficient language or environment! Short illustration: if you need to shuffle a vector, an efficient implementation in C# is:
public static IEnumerable<T> Shuffle<T>(this IEnumerable
T list = source.ToArray();
int count = list.Length;
while (count > 1)
int index = random.Next(count--);
T temp = list[index];
list[index] = list[count];
list[count] = temp;
Data science code performance comparison
I am sharing some interesting results on code performance. I needed to implement an algorithm in R that I had previously written in pure C#. I ended with 2 versions:
- Version 1: direct translation of C# code to R, with explicit loops and using only custom functions
- Version 2 : implementation that leveraged the native features such as implicit loops and vectorized operations
In a statistical programming environment, data manipulations are quite easier! Coming back to my to-be-translated-algorithm, not only the Version 2 code was much shorter, but it was also much faster on various problem sizes:
|Size||Version 1 (s)||Version 2 (S)||Ratio V1/V2|
Bottom line: leveraging the native built-in features of a programming environment can save time to edit the code and time to execute the code!
Want to learn more?
At Différence, our core expertise is centered on statistic and data science, Lean applications and operational excellence, and simulation! We can train, coach and help practitioners to learn how to use statistical programming. Don’t hesitate to ask for more information by contacting us.