rust doesn't optimize closure in scan iterator

Created on 20 Dec 2013  ·  28Comments  ·  Source: rust-lang/rust

Good evening. I ran across a missed optimization I was trying to convert std::result::collect into using FromIteratator. My initial version used iter::Scan, but it proved to be 2 times slower than at --opt-level=3 than the original implementation. I also created a custom iterator that doesn't have the closure, and it compiles down to the same speed as the original std::result::collect. I'm guessing llvm isn't able to inline the closure for some reason.

I've gathered up this test in a gist. Note that in this example, the Scan1::size_hint() is different than the std::iter::Scan::size_hint() method.

cc @thestinger

C-enhancement I-slow P-low

Most helpful comment

Okay I think I see what happened here, at least procedurally. (I do not yet have an explanation for the performance difference I am observing, but I want to first document the details of the observations that I have made, because the historical trajectory is interesting.)

First, here's what I did (in details block):

  • I took the dotdash-benchmark, and I made two versions of the bench_from_iter_with_scan: the one that @dotdash had is bench_from_iter_with_scan_passing_state, and I made a variant that uses a free-variable in the closure and just passes () for the parameter (just because I wanted to see if it ever mattered). I didn't notice any signficant difference in my results, so I do not think the free-variable vs explicit-parameter styles matter.
  • Here is the source code I used: https://gist.github.com/pnkfelix/7cb66d32c3cd9e1f2ea1a61ef15a0a3a
  • I ran the dotdash-benchmark on each stable compiler version from 1.13 through 1.33 (using RUSTC_BOOTSTRAP=1 to get around the #![feature(test)] restriction).
  • I ran each benchmark executable five times, via this shell invocation:
% for VERSION in 1.{13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33}.0 ; do (\
  echo $VERSION && rustup update $VERSION && rustc +$VERSION --version && \
  RUSTC_BOOTSTRAP=1 rustc +$VERSION -C opt-level=3  --test main-bench.rs && \
   ./main-bench --bench && ./main-bench --bench && ./main-bench --bench && ./main-bench --bench && ./main-bench --bench \
) ; done | tee version-bench.txt
  • Full log is at this gist: https://gist.github.com/2db1e313af68ad6936a3da28c0c272f5
  • I averaged the numbers in a spreadsheet (maybe I should have taken minimums rather than averages, but from skimming the data I don't think it matters), and looked at the performance of the variant implementations compared to the built-in explicit-adapter based one.
  • While I took averages for my analysis, in the text below I just grabbed representative output from the gist that is similar enough to the averages I observed.

Observations (in details block):

  • I definitely see that while back in 1.13 the performance was about the same for the explicit-adapter (i.e. the old implementation) as for the scan-based implementation (and a little slower for filter_map). That supports @pmarcelll 's findings from 2016. Here is an sample run from 1.33 (on my Linux desktop):
rustc 1.13.0 (2c6933acc 2016-11-07)

running 4 tests
test bench_from_iter                         ... bench:       1,691 ns/iter (+/- 9)
test bench_from_iter_with_filter_map         ... bench:       1,930 ns/iter (+/- 230)
test bench_from_iter_with_scan_passing_state ... bench:       1,682 ns/iter (+/- 10)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,686 ns/iter (+/- 55)
  • Between 1.14 and 1.15, the relative performance of the filter_map-based implementation regressed, but the relative-performance of the scan-based implementation remained competitive.

  • Between 1.18 and 1.19, both the explicit-adapter (bench_from_iter) and the scan-based implementations got a little bit faster. So the relative performance of the scan-based implementation still remained competitive.

  • But then between 1.19 and 1.20 the explicit-adapter regressed a lot, so that the filter_map and scan-based implementations were now significantly faster than the explicit-adapter:

rustc 1.20.0 (f3d6973f4 2017-08-27)

running 4 tests
test bench_from_iter                         ... bench:       2,274 ns/iter (+/- 54)
test bench_from_iter_with_filter_map         ... bench:       1,793 ns/iter (+/- 14)
test bench_from_iter_with_scan_passing_state ... bench:       1,716 ns/iter (+/- 7)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,716 ns/iter (+/- 26)
  • Between 1.22 and 1.23, the performance of the scan-based implementation regressed significantly while the others remained constant:
rustc 1.23.0 (766bd11c8 2018-01-01)

running 4 tests
test bench_from_iter                         ... bench:       2,238 ns/iter (+/- 2)
test bench_from_iter_with_filter_map         ... bench:       1,755 ns/iter (+/- 6)
test bench_from_iter_with_scan_passing_state ... bench:       3,506 ns/iter (+/- 6)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,508 ns/iter (+/- 2)
  • Then between 1.23 and 1.24, the performance of the filter_map-based implementation regressed to the point where it was no longer beating bench_from_iter by any significant degree.
rustc 1.24.0 (4d90ac38c 2018-02-12)

running 4 tests
test bench_from_iter                         ... bench:       2,323 ns/iter (+/- 8)
test bench_from_iter_with_filter_map         ... bench:       2,535 ns/iter (+/- 4)
test bench_from_iter_with_scan_passing_state ... bench:       3,796 ns/iter (+/- 15)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,792 ns/iter (+/- 11)
  • Between 1.26 and 1.27 the performance of the filter_map-based implementation regressed again, quite significantly:
rustc 1.27.0 (3eda71b00 2018-06-19)

running 4 tests
test bench_from_iter                         ... bench:       2,442 ns/iter (+/- 5)
test bench_from_iter_with_filter_map         ... bench:       5,080 ns/iter (+/- 55)
test bench_from_iter_with_scan_passing_state ... bench:       3,594 ns/iter (+/- 12)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,591 ns/iter (+/- 2)


So: that is a series of steps that sort of masks the bigger story here: these benchmarks went from this under 1.13:

rustc 1.13.0 (2c6933acc 2016-11-07)

running 4 tests
test bench_from_iter                         ... bench:       1,691 ns/iter (+/- 9)
test bench_from_iter_with_filter_map         ... bench:       1,930 ns/iter (+/- 230)
test bench_from_iter_with_scan_passing_state ... bench:       1,682 ns/iter (+/- 10)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,686 ns/iter (+/- 55)

to this under 1.27:

rustc 1.27.0 (3eda71b00 2018-06-19)

running 4 tests
test bench_from_iter                         ... bench:       2,442 ns/iter (+/- 5)
test bench_from_iter_with_filter_map         ... bench:       5,080 ns/iter (+/- 55)
test bench_from_iter_with_scan_passing_state ... bench:       3,594 ns/iter (+/- 12)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,591 ns/iter (+/- 2)

Which does not seem like a good trajectory over time, for any of the variants, in both absolute and relative terms.

(Note that there is non-zero chance that some other confounding factor in my benchmark apparatus is actually to blame for the gradual performance shift over time. I have thoughts on this but I am too tired to get into them here.)

All 28 comments

Do unboxed closures fix this?

Do unboxed closures fix this?

Not completely, it seems. I rewrote the original tests provided.

The implementation in the standard library uses the adapter iterator, so I only compared it to a version using scan. I also added one with filter_map, just to see how that would play out. I didn't expect those to be different, but:

running 3 tests
test bench_from_iter                 ... bench:       6,991 ns/iter (+/- 941)
test bench_from_iter_with_filter_map ... bench:       8,324 ns/iter (+/- 1,320)
test bench_from_iter_with_scan       ... bench:       7,299 ns/iter (+/- 1,172)

So it seems like the adapter iterator is still the most performant.

@shepmaster Thanks for updating the benchmarks! Unfortunately they primarily measure the time it takes to create the v source vectors. The results are highly unstable and re-running the benchmark a few times had a different winner everytime, because the results vary enough to make up for the differences.

Improved benchmark at https://gist.github.com/dotdash/5ce3f0bfa2e652ca58b5 -- I simply moved the vector creation out of the loop and user iter().cloned() instead of into_iter(). As the vectors are made up of i32 values, that shouldn't affect the result compared for the actual loop that we want to measure.

Results:

test bench_from_iter                 ... bench:       2,304 ns/iter (+/- 23)
test bench_from_iter_with_filter_map ... bench:       4,996 ns/iter (+/- 122)
test bench_from_iter_with_scan       ... bench:       3,071 ns/iter (+/- 33)

they primarily measure the time it takes to create the v source vectors

Yeah, I should have noted that. I left if that way just to be consistent with the original version, but I also figured that since each was doing the same over-work they would come out consistent. A small number (~5) runs had the same winners for me, so I shipped it! :-)

Thanks for improving it even more!

Results with the improved benchmark on nightly:

running 3 tests
test bench_from_iter                 ... bench:       2,158 ns/iter (+/- 24)
test bench_from_iter_with_filter_map ... bench:       2,474 ns/iter (+/- 89)
test bench_from_iter_with_scan       ... bench:       2,145 ns/iter (+/- 35)

There is a FIXME related to this issue,
https://github.com/rust-lang/rust/blob/master/src/libcore/option.rs#L1092
https://github.com/rust-lang/rust/blob/master/src/libcore/result.rs#L1117
Is this now resolved? If so can the FIXME be fix, or made more specific?

This issue can be closed, as the scan performance issue is void.

However, since Scan no longer exposes state, these FIXMEs are void too - I see no way of rewriting these FromIterator implementations using scan.

I think I see a way to rewrite them using scan. I'll just post a PR with it if I'm right; if I'm wrong, I'll post a PR removing the FIXME comments.

  • A reason that it can be tricky to use scan is that one might think that the accumulated value being built should be owned by the state parameter that is threaded through the scan; but that would be a mistake, because there is no way to recover that state after the scan is finished, as @ljedrz noted above.
  • From what I can tell, the reason the FIXME commenter mentioned scan here is not because of the threaded state parameter, but rather because scan, unlike e.g. fold, can terminate early without taking all of the elements from the given iter, which is necessary for correctly implementing FromIterator for Option and Result.

Update: reading over the previous comments more carefully, @erickt, @shepmaster and @dotdash have all provided benchmarks that illustrate how to implement FromIterator for Result using scan. (And Option is similar.)

(or perhaps try_fold would be more idiomatic here than scan) ((no that doesn't work))

Interestingly, for me I do observe a 2x slowdown on the @dotdash -provided benchmark.

(And now I need to see if I can reproduce the previous results that claims there was no slowdown...)

Okay I think I see what happened here, at least procedurally. (I do not yet have an explanation for the performance difference I am observing, but I want to first document the details of the observations that I have made, because the historical trajectory is interesting.)

First, here's what I did (in details block):

  • I took the dotdash-benchmark, and I made two versions of the bench_from_iter_with_scan: the one that @dotdash had is bench_from_iter_with_scan_passing_state, and I made a variant that uses a free-variable in the closure and just passes () for the parameter (just because I wanted to see if it ever mattered). I didn't notice any signficant difference in my results, so I do not think the free-variable vs explicit-parameter styles matter.
  • Here is the source code I used: https://gist.github.com/pnkfelix/7cb66d32c3cd9e1f2ea1a61ef15a0a3a
  • I ran the dotdash-benchmark on each stable compiler version from 1.13 through 1.33 (using RUSTC_BOOTSTRAP=1 to get around the #![feature(test)] restriction).
  • I ran each benchmark executable five times, via this shell invocation:
% for VERSION in 1.{13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33}.0 ; do (\
  echo $VERSION && rustup update $VERSION && rustc +$VERSION --version && \
  RUSTC_BOOTSTRAP=1 rustc +$VERSION -C opt-level=3  --test main-bench.rs && \
   ./main-bench --bench && ./main-bench --bench && ./main-bench --bench && ./main-bench --bench && ./main-bench --bench \
) ; done | tee version-bench.txt
  • Full log is at this gist: https://gist.github.com/2db1e313af68ad6936a3da28c0c272f5
  • I averaged the numbers in a spreadsheet (maybe I should have taken minimums rather than averages, but from skimming the data I don't think it matters), and looked at the performance of the variant implementations compared to the built-in explicit-adapter based one.
  • While I took averages for my analysis, in the text below I just grabbed representative output from the gist that is similar enough to the averages I observed.

Observations (in details block):

  • I definitely see that while back in 1.13 the performance was about the same for the explicit-adapter (i.e. the old implementation) as for the scan-based implementation (and a little slower for filter_map). That supports @pmarcelll 's findings from 2016. Here is an sample run from 1.33 (on my Linux desktop):
rustc 1.13.0 (2c6933acc 2016-11-07)

running 4 tests
test bench_from_iter                         ... bench:       1,691 ns/iter (+/- 9)
test bench_from_iter_with_filter_map         ... bench:       1,930 ns/iter (+/- 230)
test bench_from_iter_with_scan_passing_state ... bench:       1,682 ns/iter (+/- 10)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,686 ns/iter (+/- 55)
  • Between 1.14 and 1.15, the relative performance of the filter_map-based implementation regressed, but the relative-performance of the scan-based implementation remained competitive.

  • Between 1.18 and 1.19, both the explicit-adapter (bench_from_iter) and the scan-based implementations got a little bit faster. So the relative performance of the scan-based implementation still remained competitive.

  • But then between 1.19 and 1.20 the explicit-adapter regressed a lot, so that the filter_map and scan-based implementations were now significantly faster than the explicit-adapter:

rustc 1.20.0 (f3d6973f4 2017-08-27)

running 4 tests
test bench_from_iter                         ... bench:       2,274 ns/iter (+/- 54)
test bench_from_iter_with_filter_map         ... bench:       1,793 ns/iter (+/- 14)
test bench_from_iter_with_scan_passing_state ... bench:       1,716 ns/iter (+/- 7)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,716 ns/iter (+/- 26)
  • Between 1.22 and 1.23, the performance of the scan-based implementation regressed significantly while the others remained constant:
rustc 1.23.0 (766bd11c8 2018-01-01)

running 4 tests
test bench_from_iter                         ... bench:       2,238 ns/iter (+/- 2)
test bench_from_iter_with_filter_map         ... bench:       1,755 ns/iter (+/- 6)
test bench_from_iter_with_scan_passing_state ... bench:       3,506 ns/iter (+/- 6)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,508 ns/iter (+/- 2)
  • Then between 1.23 and 1.24, the performance of the filter_map-based implementation regressed to the point where it was no longer beating bench_from_iter by any significant degree.
rustc 1.24.0 (4d90ac38c 2018-02-12)

running 4 tests
test bench_from_iter                         ... bench:       2,323 ns/iter (+/- 8)
test bench_from_iter_with_filter_map         ... bench:       2,535 ns/iter (+/- 4)
test bench_from_iter_with_scan_passing_state ... bench:       3,796 ns/iter (+/- 15)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,792 ns/iter (+/- 11)
  • Between 1.26 and 1.27 the performance of the filter_map-based implementation regressed again, quite significantly:
rustc 1.27.0 (3eda71b00 2018-06-19)

running 4 tests
test bench_from_iter                         ... bench:       2,442 ns/iter (+/- 5)
test bench_from_iter_with_filter_map         ... bench:       5,080 ns/iter (+/- 55)
test bench_from_iter_with_scan_passing_state ... bench:       3,594 ns/iter (+/- 12)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,591 ns/iter (+/- 2)


So: that is a series of steps that sort of masks the bigger story here: these benchmarks went from this under 1.13:

rustc 1.13.0 (2c6933acc 2016-11-07)

running 4 tests
test bench_from_iter                         ... bench:       1,691 ns/iter (+/- 9)
test bench_from_iter_with_filter_map         ... bench:       1,930 ns/iter (+/- 230)
test bench_from_iter_with_scan_passing_state ... bench:       1,682 ns/iter (+/- 10)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,686 ns/iter (+/- 55)

to this under 1.27:

rustc 1.27.0 (3eda71b00 2018-06-19)

running 4 tests
test bench_from_iter                         ... bench:       2,442 ns/iter (+/- 5)
test bench_from_iter_with_filter_map         ... bench:       5,080 ns/iter (+/- 55)
test bench_from_iter_with_scan_passing_state ... bench:       3,594 ns/iter (+/- 12)
test bench_from_iter_with_scan_with_freevar  ... bench:       3,591 ns/iter (+/- 2)

Which does not seem like a good trajectory over time, for any of the variants, in both absolute and relative terms.

(Note that there is non-zero chance that some other confounding factor in my benchmark apparatus is actually to blame for the gradual performance shift over time. I have thoughts on this but I am too tired to get into them here.)

making this a P-high issue to at least diagnose what has happened to our performance in the cases noted in previous comment.

(It might be better to fork such investigation off into its own separate issue. but again, I am tired and am just trying to do the minimal amount necessary to keep this on my radar in the short term with respect to tagging and assigning myself.)

@pnkfelix Could you share the code for your benchmark?

perf shows this as the hotspot for my old bench_from_iter_with_scan benchmark:

  0.09 │       movaps 0x20(%rsp),%xmm0
 34.33 │       movups %xmm0,(%r15)
  5.85 │       movups (%r15),%xmm0
 49.67 │       movaps %xmm0,0x20(%rsp)

which seems pretty hilarious.

Compiled with rustc -O --test:

running 3 tests
test bench_from_iter                 ... bench:       2,170 ns/iter (+/- 20)
test bench_from_iter_with_filter_map ... bench:       3,412 ns/iter (+/- 67)
test bench_from_iter_with_scan       ... bench:       4,416 ns/iter (+/- 40)

Compiled with rustc -O --test -Ccodegen-units=1:

running 3 tests
test bench_from_iter                 ... bench:         977 ns/iter (+/- 18)
test bench_from_iter_with_filter_map ... bench:       3,296 ns/iter (+/- 36)
test bench_from_iter_with_scan       ... bench:       2,280 ns/iter (+/- 40)

@dotdash here is the file I'm compiling for my benchmark: https://gist.github.com/pnkfelix/7cb66d32c3cd9e1f2ea1a61ef15a0a3a

Okay; I had forgotten about the change to the defaults for codegen-units. (Maybe -C opt-level=3 should change the default for codegen-units back to 1? Just a thought...)

Results for my linux machine, with --test -C opt-level=3 -C codegen-units=1:

rustc 1.33.0 (2aa4c46cf 2019-02-28)

running 4 tests
test bench_from_iter                         ... bench:       1,432 ns/iter (+/- 44)
test bench_from_iter_with_filter_map         ... bench:       1,504 ns/iter (+/- 3)
test bench_from_iter_with_scan_passing_state ... bench:       2,088 ns/iter (+/- 4)
test bench_from_iter_with_scan_with_freevar  ... bench:       2,050 ns/iter (+/- 12)
rustc 1.13.0 (2c6933acc 2016-11-07)

running 4 tests
test bench_from_iter                         ... bench:       1,688 ns/iter (+/- 50)
test bench_from_iter_with_filter_map         ... bench:       1,953 ns/iter (+/- 10)
test bench_from_iter_with_scan_passing_state ... bench:       1,677 ns/iter (+/- 62)
test bench_from_iter_with_scan_with_freevar  ... bench:       1,683 ns/iter (+/- 17)

One additional caveat I will add: while I remain concerned about some of the performance differences observed here, I also worry about controlling for factors introduced by the benchmark harness itself.

In particular, I made two variants of bench_from_iter (one with the adapter-implementation written inline (bench_from_iter_using_adapter), and another that continues to call out to the stdlib implementation that it was copied from (bench_from_iter_using_baseline)), source code here, and I am currently observing this:

12-45-10 Mozilla/issue11084 % rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench
12-45-45 Mozilla/issue11084 % ./main-bench --bench

running 5 tests
test bench_from_iter_using_adapter____ ... bench:       1,531 ns/iter (+/- 6)
test bench_from_iter_using_baseline___ ... bench:       1,277 ns/iter (+/- 10)
test bench_from_iter_with_filter_map__ ... bench:       3,508 ns/iter (+/- 64)
test bench_from_iter_with_scan_freevar ... bench:       2,066 ns/iter (+/- 53)
test bench_from_iter_with_scan_param__ ... bench:       2,055 ns/iter (+/- 30)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured; 0 filtered out

12-45-51 Mozilla/issue11084 % ./main-bench --bench

running 5 tests
test bench_from_iter_using_adapter____ ... bench:       1,514 ns/iter (+/- 6)
test bench_from_iter_using_baseline___ ... bench:       1,305 ns/iter (+/- 12)
test bench_from_iter_with_filter_map__ ... bench:       3,491 ns/iter (+/- 56)
test bench_from_iter_with_scan_freevar ... bench:       2,061 ns/iter (+/- 5)
test bench_from_iter_with_scan_param__ ... bench:       2,094 ns/iter (+/- 53)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured; 0 filtered out

12-45-53 Mozilla/issue11084 % ./main-bench --bench adapter

running 1 test
test bench_from_iter_using_adapter____ ... bench:       1,574 ns/iter (+/- 12)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

12-45-57 Mozilla/issue11084 % ./main-bench --bench adapter

running 1 test
test bench_from_iter_using_adapter____ ... bench:       1,530 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

12-45-59 Mozilla/issue11084 % ./main-bench --bench baseline

running 1 test
test bench_from_iter_using_baseline___ ... bench:       1,569 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

12-46-01 Mozilla/issue11084 % ./main-bench --bench baseline

running 1 test
test bench_from_iter_using_baseline___ ... bench:       1,576 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

12-46-03 Mozilla/issue11084 %

Note in particular that the timing for bench_from_iter_using_baseline jumped from [1277, 1305] up to [1569, 1576], just by being called on its own rather than in sequence with the other benchmarks. These results seem to be fairly consistent on continued repeated runs.

That means there may be cache effects or memory-allocator issues that are causing the effects of running one benchmark to bleed into the execution of another.

(That in turn means I need to be very careful about which executions I consider to be worrisome with respect to performance. That, or I need to be more careful about how I run the benchmarks.)

(one way to control for some of the aforementioned effects would be to rewrite this benchmark such that it does not accumulate into a Vec; the performance of that operation is not what the original code was trying to measure, and it certainly is not what should be blocking how we implement FromIterator for Result and Option, right? )

Okay, so here is a revised version of the code that has added a new target type Last<T> for FromIterator. It just remembers the final element in the iteration (via .last()), so there's no noise from any heap management as the benchmark runs. (Also, since there is no mem-mgmt overhead, I pumped up the input size a little so that the ns/iter would still be pretty large.)

https://gist.github.com/80fe57277434ef31f018c08fa8e21f84

Hopefully this can provide additional focus on the FromIterator implementation artifacts. (Notably: the measured ns/iter for the Last<T> benchmark variants seem to be stable regardless of whether I run them on their own or as part of the overall suite; that was not the case for the Vec<T> variants.)

With this, I get the following results:

% RUSTC_BOOTSTRAP=1 rustc +1.13.0 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,692 ns/iter (+/- 8)
test bench__vec_from_iter_using_baseline___ ... bench:       1,692 ns/iter (+/- 67)
test bench__vec_from_iter_with_filter_map__ ... bench:       1,963 ns/iter (+/- 10)
test bench__vec_from_iter_with_scan_freevar ... bench:       1,692 ns/iter (+/- 11)
test bench__vec_from_iter_with_scan_param__ ... bench:       1,690 ns/iter (+/- 8)
test bench_last_from_iter_using_adapter____ ... bench:       2,156 ns/iter (+/- 10)
test bench_last_from_iter_using_baseline___ ... bench:       2,158 ns/iter (+/- 8)
test bench_last_from_iter_with_filter_map__ ... bench:       3,022 ns/iter (+/- 8)
test bench_last_from_iter_with_scan_freevar ... bench:       2,156 ns/iter (+/- 12)
test bench_last_from_iter_with_scan_param__ ... bench:       2,155 ns/iter (+/- 6)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

% RUSTC_BOOTSTRAP=1 rustc +1.33.0 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,211 ns/iter (+/- 3)
test bench__vec_from_iter_using_baseline___ ... bench:       1,280 ns/iter (+/- 4)
test bench__vec_from_iter_with_filter_map__ ... bench:       1,504 ns/iter (+/- 4)
test bench__vec_from_iter_with_scan_freevar ... bench:       2,088 ns/iter (+/- 61)
test bench__vec_from_iter_with_scan_param__ ... bench:       2,054 ns/iter (+/- 5)
test bench_last_from_iter_using_adapter____ ... bench:       2,186 ns/iter (+/- 5)
test bench_last_from_iter_using_baseline___ ... bench:       2,186 ns/iter (+/- 19)
test bench_last_from_iter_with_filter_map__ ... bench:       1,659 ns/iter (+/- 12)
test bench_last_from_iter_with_scan_freevar ... bench:       2,692 ns/iter (+/- 2)
test bench_last_from_iter_with_scan_param__ ... bench:       2,693 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

% RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,270 ns/iter (+/- 53)
test bench__vec_from_iter_using_baseline___ ... bench:       1,265 ns/iter (+/- 54)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,469 ns/iter (+/- 7)
test bench__vec_from_iter_with_scan_freevar ... bench:       2,076 ns/iter (+/- 1)
test bench__vec_from_iter_with_scan_param__ ... bench:       2,128 ns/iter (+/- 47)
test bench_last_from_iter_using_adapter____ ... bench:       2,186 ns/iter (+/- 5)
test bench_last_from_iter_using_baseline___ ... bench:       2,185 ns/iter (+/- 57)
test bench_last_from_iter_with_filter_map__ ... bench:       2,650 ns/iter (+/- 13)
test bench_last_from_iter_with_scan_freevar ... bench:       2,678 ns/iter (+/- 12)
test bench_last_from_iter_with_scan_param__ ... bench:       2,674 ns/iter (+/- 15)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

So: there is still some regression when you compare the scan-based implementations with the adapter-based ones, even when building a Last<T> rather than a Vec<T>, but it is not as pronounced a difference: the slowdown is a factor of 1.22x for Last<T> (compared to something like 1.67x for Vec<T>).

Here's a graph of @pnkfelix's original data showing the variance (creation scripts).

graph

Then someone went and changed the testing methods, so this may not be useful anymore...

Then someone went and changed the testing methods

(well, all I did was add new variants; the original benchmarks are still there...)

Update: I will admit that one need to figure out some sort of normalization function before attempting to compare the Vec-building variants with the Last-building ones; but we shouldn't be trying to do that anyway, right?

Ah here is one last run: a note from @eddyb elsewhere pointed out that using codegen-units=1 disables link-time optimizations.

Lets explicitly control link-time optimizations via -C lto=... and see what happens:

lto=off (same as default with codegen-units=1)

% RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench -C lto=off  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,260 ns/iter (+/- 6)
test bench__vec_from_iter_using_baseline___ ... bench:       1,306 ns/iter (+/- 48)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,467 ns/iter (+/- 3)
test bench__vec_from_iter_with_scan_freevar ... bench:       2,082 ns/iter (+/- 12)
test bench__vec_from_iter_with_scan_param__ ... bench:       2,087 ns/iter (+/- 52)
test bench_last_from_iter_using_adapter____ ... bench:       2,186 ns/iter (+/- 6)
test bench_last_from_iter_using_baseline___ ... bench:       2,187 ns/iter (+/- 21)
test bench_last_from_iter_with_filter_map__ ... bench:       2,650 ns/iter (+/- 3)
test bench_last_from_iter_with_scan_freevar ... bench:       2,683 ns/iter (+/- 4)
test bench_last_from_iter_with_scan_param__ ... bench:       2,688 ns/iter (+/- 8)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

lto=fat

% RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench -C lto=fat  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,289 ns/iter (+/- 19)
test bench__vec_from_iter_using_baseline___ ... bench:       1,232 ns/iter (+/- 9)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,481 ns/iter (+/- 19)
test bench__vec_from_iter_with_scan_freevar ... bench:       1,273 ns/iter (+/- 20)
test bench__vec_from_iter_with_scan_param__ ... bench:       1,273 ns/iter (+/- 8)
test bench_last_from_iter_using_adapter____ ... bench:       1,680 ns/iter (+/- 18)
test bench_last_from_iter_using_baseline___ ... bench:       1,683 ns/iter (+/- 21)
test bench_last_from_iter_with_filter_map__ ... bench:       3,047 ns/iter (+/- 2)
test bench_last_from_iter_with_scan_freevar ... bench:       1,793 ns/iter (+/- 81)
test bench_last_from_iter_with_scan_param__ ... bench:       1,773 ns/iter (+/- 49)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

lto=thin

% RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=1 --test main-bench.rs -o main-bench -C lto=thin  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,263 ns/iter (+/- 8)
test bench__vec_from_iter_using_baseline___ ... bench:       1,238 ns/iter (+/- 49)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,702 ns/iter (+/- 32)
test bench__vec_from_iter_with_scan_freevar ... bench:       1,248 ns/iter (+/- 3)
test bench__vec_from_iter_with_scan_param__ ... bench:       1,248 ns/iter (+/- 13)
test bench_last_from_iter_using_adapter____ ... bench:       1,616 ns/iter (+/- 6)
test bench_last_from_iter_using_baseline___ ... bench:       1,612 ns/iter (+/- 6)
test bench_last_from_iter_with_filter_map__ ... bench:       3,024 ns/iter (+/- 7)
test bench_last_from_iter_with_scan_freevar ... bench:       1,427 ns/iter (+/- 7)
test bench_last_from_iter_with_scan_param__ ... bench:       1,427 ns/iter (+/- 11)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

Notably, using lto (fat or thin) brings scan-performance in line with the baseline that we currently provide (or improves upon it).

I think has convinced me that there is not much we should do about this case in particular (apart from trying to resolve broader issues with link-time optimizations and/or codegen-unit partitioning).

And I am going to go ahead and put up a PR with the simplified implementations of FromIterator for Option and Result, linking to the results of this investigation.

Well, okay, to be fair, I realized I should do a similar experiment, but this time show the results when you vary lto but you have codegen-units ≠ 1:

lto=off codegen-units=16

11-54-00 Mozilla/issue11084 % RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=16 --test main-bench.rs -o main-bench -C lto=off  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       6,830 ns/iter (+/- 66)
test bench__vec_from_iter_using_baseline___ ... bench:       6,848 ns/iter (+/- 30)
test bench__vec_from_iter_with_filter_map__ ... bench:       5,547 ns/iter (+/- 20)
test bench__vec_from_iter_with_scan_freevar ... bench:       5,592 ns/iter (+/- 84)
test bench__vec_from_iter_with_scan_param__ ... bench:       5,322 ns/iter (+/- 12)
test bench_last_from_iter_using_adapter____ ... bench:       9,820 ns/iter (+/- 216)
test bench_last_from_iter_using_baseline___ ... bench:      14,730 ns/iter (+/- 116)
test bench_last_from_iter_with_filter_map__ ... bench:       9,869 ns/iter (+/- 16)
test bench_last_from_iter_with_scan_freevar ... bench:      11,351 ns/iter (+/- 131)
test bench_last_from_iter_with_scan_param__ ... bench:      10,593 ns/iter (+/- 51)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

lto=fat codegen-units=16

11-54-19 Mozilla/issue11084 % RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=16 --test main-bench.rs -o main-bench -C lto=fat  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       1,914 ns/iter (+/- 12)
test bench__vec_from_iter_using_baseline___ ... bench:       1,914 ns/iter (+/- 2)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,406 ns/iter (+/- 10)
test bench__vec_from_iter_with_scan_freevar ... bench:       1,354 ns/iter (+/- 4)
test bench__vec_from_iter_with_scan_param__ ... bench:       1,340 ns/iter (+/- 7)
test bench_last_from_iter_using_adapter____ ... bench:       1,648 ns/iter (+/- 17)
test bench_last_from_iter_using_baseline___ ... bench:       1,607 ns/iter (+/- 11)
test bench_last_from_iter_with_filter_map__ ... bench:       8,309 ns/iter (+/- 23)
test bench_last_from_iter_with_scan_freevar ... bench:       1,648 ns/iter (+/- 8)
test bench_last_from_iter_with_scan_param__ ... bench:       1,648 ns/iter (+/- 14)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

lto=thin codegen-units=16

11-55-02 Mozilla/issue11084 % RUSTC_BOOTSTRAP=1 rustc +nightly-2019-03-20 -C opt-level=3 -C codegen-units=16 --test main-bench.rs -o main-bench -C lto=thin  && ./main-bench --bench

running 10 tests
test bench__vec_from_iter_using_adapter____ ... bench:       2,076 ns/iter (+/- 82)
test bench__vec_from_iter_using_baseline___ ... bench:       2,076 ns/iter (+/- 3)
test bench__vec_from_iter_with_filter_map__ ... bench:       3,792 ns/iter (+/- 11)
test bench__vec_from_iter_with_scan_freevar ... bench:       6,046 ns/iter (+/- 21)
test bench__vec_from_iter_with_scan_param__ ... bench:       6,064 ns/iter (+/- 46)
test bench_last_from_iter_using_adapter____ ... bench:       1,429 ns/iter (+/- 12)
test bench_last_from_iter_using_baseline___ ... bench:       1,501 ns/iter (+/- 21)
test bench_last_from_iter_with_filter_map__ ... bench:       8,945 ns/iter (+/- 91)
test bench_last_from_iter_with_scan_freevar ... bench:      15,404 ns/iter (+/- 15)
test bench_last_from_iter_with_scan_param__ ... bench:      15,385 ns/iter (+/- 17)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

The regressions for the scan-based iterator when you have codegen-units=16 and lto=thin is pretty scary (and for this desktop machine, codegen-units=16 is the default)... should that block landing the proposed change?

Update: @michaelwoerister has informed me that the default for LTO when you do not specify anything is not necessarily LTO off. (which, given the numbers above, makes sense, since there is probably something going on that is making the baseline and scan-based implementations competitive when codegen-units=16 when nothing is stated about LTO). (The numbers above do not capture this case, because I was not aware that it did not have an explicit -C lto encoding.)

  • In particular, if you do not specify -C lto _and_ you compile with multiple codegen units, then something called "local ThinLTO" will be used.
  • If you compile with just one codegen unit and do not specify -C lto, then no LTO at all is used.

There is no way to explictly opt into "local ThinLTO" via the -C lto flag. I am more used to systems where there is some way to explicitly encode a setting that matches whatever the default is, which is why I was confused about what was happening, and some of the headings in my data above might be confused or at least confusing.

making this a P-high issue to at least diagnose what has happened to our performance in the cases noted in previous comment.

Okay I am satisfied with the results of my investigation.

So I'm immediately going to mark this as P-low for now.


But furthermore, I do not think there is much else that is actionable on this specific issue.

I have posted a PR that demonstrates the change to impl FromIterator that motivated the original issue here, as well as benchmarks for it. That's in PR #59605.

And I do wonder whether we need to revise some of our defaults for the compiler command UI, in terms of how the various LTO settings are specified or at least documented, and also in terms of changing what -C opt-level=3 implies (e.g. I think that should imply codegen-units=1 if the user has not already specified some other setting for codegen-units). But that is not related to this issue as written.

Therefore, I am inclined to actually close this issue. But I think I'll hold off on doing so until I get some feedback on PR #59605.

@pnkfelix Could you share the code for your benchmark?

perf shows this as the hotspot for my old bench_from_iter_with_scan benchmark:

  0.09 │       movaps 0x20(%rsp),%xmm0
 34.33 │       movups %xmm0,(%r15)
  5.85 │       movups (%r15),%xmm0
 49.67 │       movaps %xmm0,0x20(%rsp)

which seems pretty hilarious.

Its possible that we might be able to do something about this phenomenon. It might be low-hanging fruit for optimization, maybe.

based on my summary in previous comment, and the inconclusive results from PR #59605, I'm just going to close this.

Was this page helpful?
0 / 5 - 0 ratings