How to use all your cores in your super Linux PC
My new hardcore PC har 24 cores. Twentyfour! Most of them sits idle when I run my code. It’s a bloody waste! Also, it is inefficient. I recently needed to run an import process, that extracted and converted some data in the process. It was fairly complicated, and took about 20 minutes per “day of data”. I had approximately 3 months of data, so every run would take 3 * 30 * 20 minutes to run. Thats a whopping 1800 minutes or 30 hours. And since the requirements sometimes changed, I had to run it again, and again.
I needed to fix that! But how?
Understanding the Need for GNU Parallel
Python’s threading limitations often result in underutilized multi-core systems. While Python offers threading and a “multiprocessing” package, these can be complex and limited. You also have to customize your code to that partiular solution. GNU Parallel simplifies the execution of multiple Python processes concurrently, enabling each to run on a separate core.
How GNU Parallel Works
GNU Parallel is a shell tool that executes jobs in parallel across one or several computers.
Basically, you could run 24 terminals, and start your application 24 times, using different parameters. That works (I tried it) but it is manual and cumbersome. GNU Parallel helps you with the nitty grittystuff here.
Setting Up Your Application for Parallel Processing
To utilize GNU Parallel effectively, your Python application should be designed to work in segments or batches. This means partitioning the task so that each instance of the application operates independently on a chunk of data. It’s crucial to enable command-line arguments to specify operations for each segment.
Practical Example: Parallelizing Data Processing Tasks
Consider a Python script designed to calculate indices on financial data. Here’s how you can set up the script to run for a range of dates, using GNU Parallel to utilize up to 24 cores concurrently:
Note that 24 is hardcoded in my case, since I have 24 cores. But you can use nproc to find out how many cores your CPU has. Also note that I use venv, a python virtual environment.
The application is called like this
python myapp.py --date YYYYMMDD
the date YYYYMMDD is calculated in this script, between $start_date and $end_date. The output is fed to parallel which does the actual execution of the process.
echo "Starting processing"
current_date=$start_date
while [[ $current_date -le $end_date ]]; do
echo $current_date
current_date=$(increment_date $current_date)
done | parallel -j 24 --bar --ungroup --joblog "./parallel-log.txt" "./venv/bin/python myapp.py --force --date {} > ./parallel-log-{}.txt 2>&1"
Depending on the number of cores you have, you slash the running time to about 1/#cores which in my case meant that in under an hour I was done. And the utilization was 100% on all my cores during that time.
When you have more jobs than cores, GNU Parallel queues them up and starts a new job as soon as the previous is finished.
Benefits
- Maximized Resource Utilization: By allowing each Python instance to run on a separate core, GNU Parallel maximizes the use of the system’s CPUs, reducing overall computation time.
- Simplified Process Management: GNU Parallel manages the complexity of job distribution and load balancing, simplifying multi-process application management.
- Scalability: Easily scale your processing across more data or machines with minimal adjustments.
Negatives
- Application must be partitioned: ie must be able to take arguments and work against that particular set of data
Conclusion
Knowing about tools like GNU Parallel helps you understand when partitioning your application is a good idea. CPU Cores needs to be regularly excercised or they will rust! It is wasteful to run 1-core code. And you will save time, lots of time. When you can create even more code.