Though maybe not the most efficient way to see what Chapel is doing, I decided to make a literal translation of my day24 solution into the Go programming language for comparison.If I understood the internal workings of Chapel, maybe that limitation would make more sense.
The result was
Code:
$ ./go24 # Pi 4B at 1500 MHzAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=4)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 46.839487374 seconds.62.579 / 46.839 = 1.336
the Go solution is about 33 percent faster than the Chapel solution on the Pi.
On the other hand when running on the 12-core Xeon server I obtained
Code:
$ ./go24 # Xeon E5-2620 12c/24tAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=24)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 22.600797331 seconds.6.65257 / 22.601 = 0.2943
the Go solution is 70 percent slower than Chapel solution on the Xeon.
The kittens looked confused. Does this mean that blue gopher is easier or more difficult to catch?
My suspicion is the slowdown is related to having multiple NUMA zones when running the code on both sockets. This theory is supported by the speed doubling when running on only one socket with 6 threads.
Code:
$ numactl -C 0-5 ./go24 # Xeon E5-2620 6c/6tAdvent of Code 2024 Day 24 Crossed Wires (GOMAXPROCS=6)Part 1 The z wires output 61495910098126Part 2 Swap wires css,cwt,gdd,jmv,pqt,z05,z09,z37Total execution time 9.587104874 seconds.I decided to let the kittens have a go to see if they could make that gopher run faster. If I were a gopher, I'd certainly run faster with Scratchy, Shy and Purr headed my way.
According to Purr the rungates routine spends an unnecessary amount of time allocating memory from the heap and releasing it. When running on both sockets the heap is slower because the memory allocator is actually NUMA aware and spends additional time placing the memory in a suitable zone. I'm skeptical that heap allocations are NUMA aware but think it's a reasonable idea to allocate the slices ahead of time and pass them in to avoid allocations in the inner loop.
It would be interesting to know whether Chapel benefits from a similar work around or not.
Statistics: Posted by ejolson — Mon Apr 28, 2025 8:42 pm