Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 8026

Teaching and learning resources • Re: Advent of Code 2024

$
0
0
If I understood the internal workings of Chapel, maybe that limitation would make more sense.
Though maybe not the most efficient way to see what Chapel is doing, I decided to make a literal translation of my day24 solution into the Go programming language for comparison.

The result was

Code:

$ ./go24 # Pi 4B at 1500 MHzAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=4)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 46.839487374 seconds.
Since

62.579 / 46.839 = 1.336

the Go solution is about 33 percent faster than the Chapel solution on the Pi.

On the other hand when running on the 12-core Xeon server I obtained

Code:

$ ./go24 # Xeon E5-2620 12c/24tAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=24)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 22.600797331 seconds.
Since

6.65257 / 22.601 = 0.2943

the Go solution is 70 percent slower than Chapel solution on the Xeon.

The kittens looked confused. Does this mean that blue gopher is easier or more difficult to catch?

My suspicion is the slowdown is related to having multiple NUMA zones when running the code on both sockets. This theory is supported by the speed doubling when running on only one socket with 6 threads.

Code:

$ numactl -C 0-5 ./go24 # Xeon E5-2620 6c/6tAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=6)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 9.587104874 seconds.
Note that one socket with 6c/12t was slower as was two sockets with 12c/12t.

I decided to let the kittens have a go to see if they could make that gopher run faster. If I were a gopher, I'd certainly run faster with Scratchy, Shy and Purr headed my way.

According to Purr the rungates routine spends an unnecessary amount of time allocating memory from the heap and releasing it. When running on both sockets the heap is slower because the memory allocator is actually NUMA aware and spends additional time placing the memory in a suitable zone. I'm skeptical that heap allocations are NUMA aware but think it's a reasonable idea to allocate the slices ahead of time and pass them in to avoid allocations in the inner loop.

It would be interesting to know whether Chapel benefits from a similar work around or not.

Statistics: Posted by ejolson — Mon Apr 28, 2025 8:42 pm



Viewing all articles
Browse latest Browse all 8026

Trending Articles