Skip to content

Avoid index clone in next#1584

Open
scho-furiosa wants to merge 3 commits intorust-ndarray:masterfrom
scho-furiosa:no-clone
Open

Avoid index clone in next#1584
scho-furiosa wants to merge 3 commits intorust-ndarray:masterfrom
scho-furiosa:no-clone

Conversation

@scho-furiosa
Copy link

@scho-furiosa scho-furiosa commented Feb 10, 2026

Hello, everyone.

My colleague @chaehhyun found that there are some regressions due to the clone in next during a profiling. We suspect ix.clone() inside Baseiter::next is the culprit, but not 100% for sure as of now. We are still experimenting its impact internally.

I have a couple of questions:

  • I am curious if the cloning of index in next is intentional.
  • If it is not the case, is there any pre-existing perf test in CI?

Thank you.

There is an avoidable `clone` in `next`.

This commit introduces `next_for_mut`, which is similar to `next_for`,
but updates the key in-place, and uses it in `next` instead.
@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Hello! Sorry to hear about a regression. Can you provide a minimal working example of the regression? Or some additional debug information?

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Also, what version caused the regression?

@scho-furiosa
Copy link
Author

Also, what version caused the regression?

We are using the version 0.15.6. That said, I do not think this issue is related to the version, since the line ix.clone() has been there since its initial commit 2014.

Can you provide a minimal working example of the regression? Or some additional debug information?

AFAIU, this can be problematic when there are many dimensions (probably bigger than 6, i.e. IxDyn), so the cloning of the key is expensive. Basically, I just think the cloning of the key is unnecessary for normal iteration.

Re the minimal example, I am still curious if there are perf tests in the repo, or if you want it to have. This "cloning more or less" is not about functional correctness, so it is unclear which type of minimal example you have in your mind.

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Ah thank you for the clarification. A "regression" is usually meant as a drop in performance or correctness when compared to a previous version, so I thought this was behavior you were seeing after an upgrade.

Unfortunately, performance is a known issue when using dynamic-dimensioned arrays. I haven't had a chance yet to look into the cause, but it's possible that this fix would help. I'd be happy to follow up and test to see if it improves speed. This is where a minimal example would help: the smallest possible code that exhibits the slow behavior that you're seeing.

There are some limited benchmarks in the library, but nothing comprehensive.

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

These changes look good, I appreciate the contribution! I'll put together a small benchmark to ensure that it has a positive impact on performance (I'm 99% sure that it will) and then I will merge the changes.

Comment on lines 199 to 201
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this function calls next_for_mut now?

let mut index = index;
self.next_for_mut(&mut index)

@scho-furiosa
Copy link
Author

Here is a small exmaple. The perf change does not seem to be dramatic though.

#[test]
fn test_iter_perf()
{
    let a = ArrayD::from_shape_vec(
        vec![
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2,
        ],
        vec![3.0; 1024 * 1024 * 16],
    )
    .unwrap();
    for _x in a.reversed_axes().iter() {}
}

before:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 7.02s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.90s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.89s

after:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.60s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.40s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.39s

A "regression" is usually meant as a drop in performance or correctness when compared to a previous version

Sorry for the confusion. My bad. /o\

@nilgoyette
Copy link
Collaborator

@scho-furiosa Can you please fix the errors?

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

@scho-furiosa
Copy link
Author

scho-furiosa commented Feb 12, 2026

Can you please fix the errors?

Thank you. I fixed the clippy errors, but I am not sure how to handle the miri-related one.

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

Exactly that is my understanding. The index copies in Array1-6 should have been cheap regardless of this PR, since they are copy-able in Rust.

@nilgoyette
Copy link
Collaborator

LGTM, but I'll let @akern40 do its small benchmark.

@akern40
Copy link
Collaborator

akern40 commented Feb 13, 2026

The last thing I want to see is the above ArrayD benchmark replicated for a fixed-rank array, say Array6. I see that there is a significant speedup from the change for ArrayD, but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

@scho-furiosa
Copy link
Author

scho-furiosa commented Feb 13, 2026

but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

Interesting point!

  • To precisely compare the times of iteration only, I added a timer.
  • For a fair comparison of ArrayD and Array6, revised the shape.
#[test]
fn test_iter_perfd()
{
    let a = ArrayD::from_shape_vec(vec![1024, 1024, 2, 2, 2, 2], vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

#[test]
fn test_iter_perf6()
{
    let a = Array6::from_shape_vec((1024, 1024, 2, 2, 2, 2), vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

Here is the result.

ArrayD before

Time elapsed is: 4.147015005s
Time elapsed is: 4.185004973s
Time elapsed is: 4.152269005s

Array6 before

Time elapsed is: 2.877281448s
Time elapsed is: 2.874772023s
Time elapsed is: 2.847792364s

======

ArrayD after

Time elapsed is: 2.870766284s
Time elapsed is: 2.859789082s
Time elapsed is: 2.862473005s


Array6 after

Time elapsed is: 2.761921391s
Time elapsed is: 2.771982213s
Time elapsed is: 2.749912549s
  • A significant speed up in ArrayD, about -31%.
  • A minor speed up in Array6, about -3%.
  • The gap between ArrayD and Array6 was x1.45 and now is x1.04 with this PR.

@akern40
Copy link
Collaborator

akern40 commented Feb 15, 2026

@scho-furiosa thank you for the analysis! And for finding this easy optimization. PR accepted, it will be available in the next release.

@akern40
Copy link
Collaborator

akern40 commented Feb 15, 2026

@scho-furiosa sorry, holding off on a merge actually. There's some odd stuff going on when I benchmark this with Criterion. I'm still figuring it out, but I want to make sure there isn't a regression here before we merge. I will post more details soon!

@akern40
Copy link
Collaborator

akern40 commented Feb 15, 2026

Ok, here's what happened. When I went to run a benchmark with Criterion, this is what performance looked like on the existing codebase:

Benchmarking arrayd: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 45.3s, or reduce sample count to 10.
arrayd                  time:   [457.40 ms 458.80 ms 460.41 ms]

array6                  time:   [4.9280 ms 4.9316 ms 4.9360 ms]

Note that the times are on the order of milliseconds; the timings you provided were built in debug mode, not release mode.

Then, I switched to the PR branch, and I saw this:

arrayd                  time:   [32.523 ms 32.553 ms 32.588 ms]                   
                        change: [-92.903% -92.878% -92.856%] (p = 0.00 < 0.05)
                        Performance has improved.

array6                  time:   [9.7406 ms 9.7562 ms 9.7732 ms]                   
                        change: [+94.345% +96.156% +97.270%] (p = 0.00 < 0.05)
                        Performance has regressed.

Which is great news for ArrayD, but a serious regression for Array6! I tried changing the benchmark several different ways, and each time I found that the story is the same.

After some analysis, I suspect that the regression in Array6 might have something to do with introducing mutability? Maybe there is some sort of reordering going on, I don't know. I did manage to find a middle-ground, though. By using take in the Iterator impl for Baseiter, we can get the following (percentage comparisons are against the current ndarray codebase):

arrayd                  time:   [123.77 ms 124.16 ms 124.62 ms]                   
                        change: [-72.803% -72.716% -72.606%] (p = 0.00 < 0.05)
                        Performance has improved.

array6                  time:   [4.9249 ms 4.9281 ms 4.9323 ms]                    
                        change: [-0.5427% -0.2732% -0.0514%] (p = 0.02 < 0.05)
                        Change within noise threshold.

I think that a 72% performance increase is pretty good, especially if the alternative is a regression in Array6. I've tried every way I can think of to get the 32ms performance on ArrayD without the regression, but didn't manage to pull it off.

The proposed code for the final measurement is in my no-clone branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants