Avoid index clone in next by scho-furiosa · Pull Request #1584 · rust-ndarray/ndarray

scho-furiosa · 2026-02-10T21:31:34Z

Hello, everyone.

My colleague @chaehhyun found that there are some regressions due to the clone in next during a profiling. We suspect ix.clone() inside Baseiter::next is the culprit, but not 100% for sure as of now. We are still experimenting its impact internally.

I have a couple of questions:

I am curious if the cloning of index in next is intentional.
If it is not the case, is there any pre-existing perf test in CI?

Thank you.

There is an avoidable `clone` in `next`. This commit introduces `next_for_mut`, which is similar to `next_for`, but updates the key in-place, and uses it in `next` instead.

akern40 · 2026-02-11T02:27:46Z

Hello! Sorry to hear about a regression. Can you provide a minimal working example of the regression? Or some additional debug information?

akern40 · 2026-02-11T02:28:21Z

Also, what version caused the regression?

scho-furiosa · 2026-02-11T04:52:49Z

Also, what version caused the regression?

We are using the version 0.15.6. That said, I do not think this issue is related to the version, since the line ix.clone() has been there since its initial commit 2014.

Can you provide a minimal working example of the regression? Or some additional debug information?

AFAIU, this can be problematic when there are many dimensions (probably bigger than 6, i.e. IxDyn), so the cloning of the key is expensive. Basically, I just think the cloning of the key is unnecessary for normal iteration.

Re the minimal example, I am still curious if there are perf tests in the repo, or if you want it to have. This "cloning more or less" is not about functional correctness, so it is unclear which type of minimal example you have in your mind.

akern40 · 2026-02-11T13:10:26Z

Ah thank you for the clarification. A "regression" is usually meant as a drop in performance or correctness when compared to a previous version, so I thought this was behavior you were seeing after an upgrade.

Unfortunately, performance is a known issue when using dynamic-dimensioned arrays. I haven't had a chance yet to look into the cause, but it's possible that this fix would help. I'd be happy to follow up and test to see if it improves speed. This is where a minimal example would help: the smallest possible code that exhibits the slow behavior that you're seeing.

There are some limited benchmarks in the library, but nothing comprehensive.

akern40 · 2026-02-11T13:15:08Z

These changes look good, I appreciate the contribution! I'll put together a small benchmark to ensure that it has a positive impact on performance (I'm 99% sure that it will) and then I will merge the changes.

nilgoyette · 2026-02-11T13:57:32Z

src/dimension/dimension_trait.rs

Why can't this function calls next_for_mut now?

let mut index = index; self.next_for_mut(&mut index)

scho-furiosa · 2026-02-12T00:01:44Z

Here is a small exmaple. The perf change does not seem to be dramatic though.

#[test]
fn test_iter_perf()
{
    let a = ArrayD::from_shape_vec(
        vec![
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2,
        ],
        vec![3.0; 1024 * 1024 * 16],
    )
    .unwrap();
    for _x in a.reversed_axes().iter() {}
}

before:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 7.02s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.90s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.89s

after:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.60s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.40s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.39s

A "regression" is usually meant as a drop in performance or correctness when compared to a previous version

Sorry for the confusion. My bad. /o\

nilgoyette · 2026-02-12T19:56:05Z

@scho-furiosa Can you please fix the errors?

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

scho-furiosa · 2026-02-12T22:41:14Z

Can you please fix the errors?

Thank you. I fixed the clippy errors, but I am not sure how to handle the miri-related one.

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

Exactly that is my understanding. The index copies in Array1-6 should have been cheap regardless of this PR, since they are copy-able in Rust.

nilgoyette · 2026-02-13T14:34:25Z

LGTM, but I'll let @akern40 do its small benchmark.

akern40 · 2026-02-13T14:46:14Z

The last thing I want to see is the above ArrayD benchmark replicated for a fixed-rank array, say Array6. I see that there is a significant speedup from the change for ArrayD, but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

scho-furiosa · 2026-02-13T21:13:15Z

but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

Interesting point!

To precisely compare the times of iteration only, I added a timer.
For a fair comparison of ArrayD and Array6, revised the shape.

#[test]
fn test_iter_perfd()
{
    let a = ArrayD::from_shape_vec(vec![1024, 1024, 2, 2, 2, 2], vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

#[test]
fn test_iter_perf6()
{
    let a = Array6::from_shape_vec((1024, 1024, 2, 2, 2, 2), vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

Here is the result.

ArrayD before

Time elapsed is: 4.147015005s
Time elapsed is: 4.185004973s
Time elapsed is: 4.152269005s

Array6 before

Time elapsed is: 2.877281448s
Time elapsed is: 2.874772023s
Time elapsed is: 2.847792364s

======

ArrayD after

Time elapsed is: 2.870766284s
Time elapsed is: 2.859789082s
Time elapsed is: 2.862473005s


Array6 after

Time elapsed is: 2.761921391s
Time elapsed is: 2.771982213s
Time elapsed is: 2.749912549s

A significant speed up in ArrayD, about -31%.
A minor speed up in Array6, about -3%.
The gap between ArrayD and Array6 was x1.45 and now is x1.04 with this PR.

akern40 · 2026-02-15T13:42:01Z

@scho-furiosa thank you for the analysis! And for finding this easy optimization. PR accepted, it will be available in the next release.

akern40 · 2026-02-15T15:17:19Z

@scho-furiosa sorry, holding off on a merge actually. There's some odd stuff going on when I benchmark this with Criterion. I'm still figuring it out, but I want to make sure there isn't a regression here before we merge. I will post more details soon!

akern40 · 2026-02-15T16:58:11Z

Ok, here's what happened. When I went to run a benchmark with Criterion, this is what performance looked like on the existing codebase:

Benchmarking arrayd: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 45.3s, or reduce sample count to 10.
arrayd                  time:   [457.40 ms 458.80 ms 460.41 ms]

array6                  time:   [4.9280 ms 4.9316 ms 4.9360 ms]

Note that the times are on the order of milliseconds; the timings you provided were built in debug mode, not release mode.

Then, I switched to the PR branch, and I saw this:

arrayd                  time:   [32.523 ms 32.553 ms 32.588 ms]                   
                        change: [-92.903% -92.878% -92.856%] (p = 0.00 < 0.05)
                        Performance has improved.

array6                  time:   [9.7406 ms 9.7562 ms 9.7732 ms]                   
                        change: [+94.345% +96.156% +97.270%] (p = 0.00 < 0.05)
                        Performance has regressed.

Which is great news for ArrayD, but a serious regression for Array6! I tried changing the benchmark several different ways, and each time I found that the story is the same.

After some analysis, I suspect that the regression in Array6 might have something to do with introducing mutability? Maybe there is some sort of reordering going on, I don't know. I did manage to find a middle-ground, though. By using take in the Iterator impl for Baseiter, we can get the following (percentage comparisons are against the current ndarray codebase):

arrayd                  time:   [123.77 ms 124.16 ms 124.62 ms]                   
                        change: [-72.803% -72.716% -72.606%] (p = 0.00 < 0.05)
                        Performance has improved.

array6                  time:   [4.9249 ms 4.9281 ms 4.9323 ms]                    
                        change: [-0.5427% -0.2732% -0.0514%] (p = 0.02 < 0.05)
                        Change within noise threshold.

I think that a 72% performance increase is pretty good, especially if the alternative is a regression in Array6. I've tried every way I can think of to get the 32ms performance on ArrayD without the regression, but didn't manage to pull it off.

The proposed code for the final measurement is in my no-clone branch.

Avoid index clone in next

13b2736

There is an avoidable `clone` in `next`. This commit introduces `next_for_mut`, which is similar to `next_for`, but updates the key in-place, and uses it in `next` instead.

nilgoyette reviewed Feb 11, 2026

View reviewed changes

Rewrite next_for using next_for_mut

cc53dd2

Fix clippy error

d7e9561

Conversation

scho-furiosa commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akern40 commented Feb 11, 2026

Uh oh!

akern40 commented Feb 11, 2026

Uh oh!

scho-furiosa commented Feb 11, 2026

Uh oh!

akern40 commented Feb 11, 2026

Uh oh!

akern40 commented Feb 11, 2026

Uh oh!

nilgoyette Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

scho-furiosa commented Feb 12, 2026

Uh oh!

nilgoyette commented Feb 12, 2026

Uh oh!

scho-furiosa commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nilgoyette commented Feb 13, 2026

Uh oh!

akern40 commented Feb 13, 2026

Uh oh!

scho-furiosa commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akern40 commented Feb 15, 2026

Uh oh!

akern40 commented Feb 15, 2026

Uh oh!

akern40 commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scho-furiosa commented Feb 10, 2026 •

edited

Loading

scho-furiosa commented Feb 12, 2026 •

edited

Loading

scho-furiosa commented Feb 13, 2026 •

edited

Loading