How to use all your cores in your super Linux PC


My new hardcore PC har 24 cores. Twentyfour! Most of them sits idle when I run my code. It’s a bloody waste! Also, it is inefficient. I recently needed to run an import process, that extracted and converted some data in the process. It was fairly complicated, and took about 20 minutes per “day of data”. I had approximately 3 months of data, so every run would take 3 * 30 * 20 minutes to run. Thats a whopping 1800 minutes or 30 hours. And since the requirements sometimes changed, I had to run it again, and again.

I needed to fix that! But how?

Understanding the Need for GNU Parallel

Python’s threading limitations often result in underutilized multi-core systems. While Python offers threading and a “multiprocessing” package, these can be complex and limited. You also have to customize your code to that partiular solution. GNU Parallel simplifies the execution of multiple Python processes concurrently, enabling each to run on a separate core.

How GNU Parallel Works

GNU Parallel is a shell tool that executes jobs in parallel across one or several computers.

Basically, you could run 24 terminals, and start your application 24 times, using different parameters. That works (I tried it) but it is manual and cumbersome. GNU Parallel helps you with the nitty grittystuff here.

Setting Up Your Application for Parallel Processing

To utilize GNU Parallel effectively, your Python application should be designed to work in segments or batches. This means partitioning the task so that each instance of the application operates independently on a chunk of data. It’s crucial to enable command-line arguments to specify operations for each segment.

Practical Example: Parallelizing Data Processing Tasks

Consider a Python script designed to calculate indices on financial data. Here’s how you can set up the script to run for a range of dates, using GNU Parallel to utilize up to 24 cores concurrently:

Note that 24 is hardcoded in my case, since I have 24 cores. But you can use nproc to find out how many cores your CPU has. Also note that I use venv, a python virtual environment.

The application is called like this

python myapp.py --date YYYYMMDD

the date YYYYMMDD is calculated in this script, between $start_date and $end_date. The output is fed to parallel which does the actual execution of the process.

echo "Starting processing"
current_date=$start_date
while [[ $current_date -le $end_date ]]; do
    echo $current_date
    current_date=$(increment_date $current_date)
done | parallel -j 24 --bar --ungroup --joblog "./parallel-log.txt" "./venv/bin/python myapp.py --force --date {} > ./parallel-log-{}.txt 2>&1"

Depending on the number of cores you have, you slash the running time to about 1/#cores which in my case meant that in under an hour I was done. And the utilization was 100% on all my cores during that time.

When you have more jobs than cores, GNU Parallel queues them up and starts a new job as soon as the previous is finished.

Benefits

  • Maximized Resource Utilization: By allowing each Python instance to run on a separate core, GNU Parallel maximizes the use of the system’s CPUs, reducing overall computation time.
  • Simplified Process Management: GNU Parallel manages the complexity of job distribution and load balancing, simplifying multi-process application management.
  • Scalability: Easily scale your processing across more data or machines with minimal adjustments.

Negatives

  • Application must be partitioned: ie must be able to take arguments and work against that particular set of data

Conclusion

Knowing about tools like GNU Parallel helps you understand when partitioning your application is a good idea. CPU Cores needs to be regularly excercised or they will rust! It is wasteful to run 1-core code. And you will save time, lots of time. When you can create even more code.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.