We have been playing with compiler flags for some time, and I have a UART for debugging, but I keep en eye on it to be sure, yeah. I have it all wrapped in macros so that I can entirely remove it with a flag.To add to the list make sure optimisations are turned for the compiler. Also check if you have any debug code taking a long time e.g. dumping data to a UART
That's a good suggestion, however, we have it setup in a way, that each time the function for reading analog inputs is called, we save the lat value read by the ADC into it's corresponding variable, change the multiplexer and trigger a read. The ADC then does its stuff and waits until it's called again. The size of that code is practically nothing, and we have to run it with the audio loop, to cover all inputs (20) in a reasonable time. I will look into it tho.An ISR running at 919 Hz. It should only be doing the audio/DSP stuff. The "reading analog inputs (pots)" part does not have to run at 919 Hz, besides we might not get 2 usec per adc_read() and it can be much more if certain cache lines get evicted. I think the audio ISR should be lean and mean -- move everything else out. Maybe shift ADC reads to some kind of DMA scheme and run it at 100 Hz or so; then the program can read the pot settings at leisure with minimal impact on XIP cache lines.
[Edit] A test with the ADC pot reads commented out may give an idea if it impacts utilization. The main UI loop should also be mostly waiting to reduce pressure on the cache.
All being said, it doesn't really address why would the audio loop get slower when just changing the UI loop.
That is quite similar yeah, thanks for the links, I'll check it out and see if anything hits me. We were originally also thinking about splitting the tasks to do UI and stuff on core1, and only DSP on core0, but the DSP is simply too large for one core to handle.Are you double-buffering your ADC inputs?
if so, then your app seems very similar to mine, though I'm running my stereo audio inputs at 48kHz. 919Hz refresh with 48 Samples per buffer implies you are running at 44.1kHz stereo?
What I did was put all my DSP within the ISR that is triggered each time an input buffer is filled, and kept that running on core zero.
As the program starts (on core 0) it starts core 1 which is basically just polling the ADC inputs / doing UI stuff etc, and updating global variables to handle the signalling between Core 1 and Core 0 (rather than the FIFOs). I figured I'd use spinlocks to protect them if I ran into race conditions but, so far, that hasn't been an issue.
Note that in order to update the global variables from within the ISRs I found I needed to declare them as volatile.
Later on in the project I also switched to the RP2350 (well, Pico 2) due to its FPU & increased RAM.
I was designing a clockable Eurorack modular delay called the Camberwell Parrot. My code is all on Github here: https://github.com/AudioMorphology/ParrotRev2
And I documented the whole process as a video series: https://www.youtube.com/playlist?app=de ... hIvjSDlH-1
We wish we could switch to the rp2350, but it's too late now.
I really don't want to use an RTOS, but moving the DSP to the RAM is good and we'll try that. How do you mean to move the UI code XIP? I thought the XIP was just a part of the MCU that fetches data from the flash and into the cache.Run all the real time code from RAM on the RP2040. (IO mode) Move the UI code to XIP. (Compute mode using frames.) Worst case use RTOS (from RAM) to manage priorities.
Von Neumann is not a significant factor here. Pass pointers in FIFO. (No L1 cache.)
1. Our code might very well be that bigLet’s just guess two things (based on the variability u r seeing vs whether your code is fast or slow)
1. You could certainly be having XIP cache churn if you are running from flash … it is only 16K of cache. Hopefully your hot code is not that big - be ware of lookup tables as well as code - you can use __not_in_flash_func and friends to move code/data into RAM. Note that I say code and data as the cache is 2 way and I doubt you’d see that much ejection if both core were tight looping most of the time
2. Main RAM is striped by default to make the majority of cases fast (managing memory between linear banks is hard) that said you will get faster code execution by reserving scratch x for hot core 1 code and scratch y for hot core 0 code (and their stacks) - note the same issue applies to running the code from flash on both cores
3. This includes duplicating the code into two places if both cores are doing the same thing ;again - hard to know exactly what your tightest loops are since I don’t know what you are doing, but hopefully they fit in 1-2K each
4. Go look for the bus performance counters in the datasheet - those will tell u a bunch
5. You can also check the XIP cache counters in XIP_CTRL
6. You may find you want to put lookup tables in main RAM or in the scratch with the code depending on how much you hit them
2. Isn't that done by default in the mem map file? Also now that I'm thinking about it, if the cache is 16K total, does that mean that it leaves 4k for cache and 4k for stack per core? Sorry, I'm still not entirely clear on the whole flash ram caching business.
3. Duplicating the code if it runs from the ram instead of the flash you mean? I imagine the flash utilization being the same if it has to load one code twice or two codes once.
4. That's pretty awesome I didn't know about those. Thanks!
5. Def will!
6. I hit the one for distortion about 6 times a pair of samples. The LUT is 32k as I said, and I don't know if it makes sense to put it into the ram, given that removing it doesn't have that much of an impact.
Thank you all for the feedback, definitely a huge help!
Statistics: Posted by machmar — Thu Mar 06, 2025 10:29 am