Rust: write!(wr,"foo") is 10% to 72% slower than wr.write("foo".as_bytes())

Created on 2 Dec 2013  ·  9Comments  ·  Source: rust-lang/rust

This example demonstrates that the trivial case of write!(wr, "foo") is much slower than calling wr.write("foo".as_bytes()):

extern mod extra;

use std::io::mem::MemWriter;
use extra::test::BenchHarness;

#[bench]
fn bench_write_value(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        for _ in range(0, 1000) {
            mem.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_ref(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            wr.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_macro1(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            write!(wr, "abc");
        }
    });
}

#[bench]
fn bench_write_macro2(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            write!(wr, "{}", "abc");
        }
    });
}

With no optimizations:

running 4 tests
test bench_write_macro1 ... bench:    280153 ns/iter (+/- 73615)
test bench_write_macro2 ... bench:    322462 ns/iter (+/- 24886)
test bench_write_ref    ... bench:     79974 ns/iter (+/- 3850)
test bench_write_value  ... bench:     78709 ns/iter (+/- 4003)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

With --opt-level=3:

running 4 tests
test bench_write_macro1 ... bench:     62397 ns/iter (+/- 5485)
test bench_write_macro2 ... bench:     80203 ns/iter (+/- 3355)
test bench_write_ref    ... bench:     55275 ns/iter (+/- 5156)
test bench_write_value  ... bench:     56273 ns/iter (+/- 7591)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Is there anything we can do to improve this? I can think of a couple options, but I bet there are more:

  • Special case no-argument write! to compile down into wr.write("foo".as_bytes()). If we go this route, it'd be nice to also convert a series of str write!("foo {} {}", "bar", "baz").
  • Revive wr.write_str("foo"). From what I understand, that's being blocked on #6164.
  • Figure out why llvm isn't able to optimize away the write! overhead. Are there functions that should be getting inlined that are not? My scatter shot attempt didn't get any results.
C-enhancement I-slow T-compiler T-libs

All 9 comments

I suspect that fn bench_write_ref does not make virtual calls. Adding this

#[inline(never)]
fn writer_write(w: &mut Writer, b: &[u8]) {
    w.write(b);
}

#[bench]
fn bench_write_virt(bh: &mut BenchHarness) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0, 1000) {
            writer_write(wr, "abc".as_bytes());
        }
    });
}

I have, with --opt-level 3

running 5 tests
test bench_write_macro1 ... bench:    680823 ns/iter (+/- 34497)
test bench_write_macro2 ... bench:    950790 ns/iter (+/- 72309)
test bench_write_ref    ... bench:    505846 ns/iter (+/- 41965)
test bench_write_value  ... bench:    511815 ns/iter (+/- 36681)
test bench_write_virt   ... bench:    553466 ns/iter (+/- 43716)

So, the virtual call have an impact, but it does not explain totally the slowness of write!() shown by this bench.

Visiting for bug triage. This seems to still be present. The code to run the benchmarks is now:

extern crate test;

use std::io::MemWriter;
use test::Bencher;

#[bench]
fn bench_write_value(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        for _ in range(0u, 1000) {
            mem.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_ref(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0u, 1000) {
            wr.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_macro1(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0u, 1000) {
            write!(wr, "abc");
        }
    });
}

#[bench]
fn bench_write_macro2(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = MemWriter::new();
        let wr = &mut mem as &mut Writer;
        for _ in range(0u, 1000) {
            write!(wr, "{}", "abc");
        }
    });
}

Without optimizations:

running 4 tests
test bench_write_macro1 ... bench:   1470468 ns/iter (+/- 291966)
test bench_write_macro2 ... bench:   1799612 ns/iter (+/- 316293)
test bench_write_ref    ... bench:   1336574 ns/iter (+/- 251664)
test bench_write_value  ... bench:   1317880 ns/iter (+/- 254668)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

With --opt-level=3:

running 4 tests
test bench_write_macro1 ... bench:    127671 ns/iter (+/- 1452)
test bench_write_macro2 ... bench:    196158 ns/iter (+/- 2053)
test bench_write_ref    ... bench:     43881 ns/iter (+/- 453)
test bench_write_value  ... bench:     43859 ns/iter (+/- 336)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Still an issue, using the following code:

#![allow(unused_must_use)]
#![feature(test)]

extern crate test;

use std::io::Write;
use std::vec::Vec;

use test::Bencher;

#[bench]
fn bench_write_value(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = Vec::new();
        for _ in 0..1000 {
            mem.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_ref(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = Vec::new();
        let wr = &mut mem as &mut Write;
        for _ in 0..1000 {
            wr.write("abc".as_bytes());
        }
    });
}

#[bench]
fn bench_write_macro1(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = Vec::new();
        let wr = &mut mem as &mut Write;
        for _ in 0..1000 {
            write!(wr, "abc");
        }
    });
}

#[bench]
fn bench_write_macro2(bh: &mut Bencher) {
    bh.iter(|| {
        let mut mem = Vec::new();
        let wr = &mut mem as &mut Write;
        for _ in 0..1000 {
            write!(wr, "{}", "abc");
        }
    });
}

Typically giving something like the following:

$ cargo bench
running 4 tests
test bench_write_macro1 ... bench:      21,604 ns/iter (+/- 82)
test bench_write_macro2 ... bench:      29,273 ns/iter (+/- 85)
test bench_write_ref    ... bench:       1,396 ns/iter (+/- 387)
test bench_write_value  ... bench:       1,391 ns/iter (+/- 163)

I got approximately the same results with rust nightly.

Just for the record, running with v1.20.0-nightly:

$ cargo bench
running 4 tests
test bench_write_macro1 ... bench:      36,556 ns/iter (+/- 69)
test bench_write_macro2 ... bench:      54,377 ns/iter (+/- 958)
test bench_write_ref    ... bench:      13,730 ns/iter (+/- 24)
test bench_write_value  ... bench:      13,755 ns/iter (+/- 81)

Today:

running 4 tests
test bench_write_macro1 ... bench:      16,220 ns/iter (+/- 982)
test bench_write_macro2 ... bench:      25,542 ns/iter (+/- 2,220)
test bench_write_ref    ... bench:       4,889 ns/iter (+/- 314)
test bench_write_value  ... bench:       4,819 ns/iter (+/- 956)

Two years later:

running 4 tests
test bench_write_macro1 ... bench:      17,561 ns/iter (+/- 174)
test bench_write_macro2 ... bench:      23,285 ns/iter (+/- 2,771)
test bench_write_ref    ... bench:       3,234 ns/iter (+/- 194)
test bench_write_value  ... bench:       3,238 ns/iter (+/- 123)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

❯ rustc +nightly --version
rustc 1.47.0-nightly (30f0a0768 2020-08-18)

I was curious if the difference was in the implementation of io::Write vs. fmt::Write, or in the details of write_fmt requiring format_args!, which is relevant since write! can be asked to "serve" either case.

To discover more information on this, I compared Vec to String, a seemingly "apples to apples...ish" comparison. https://gist.github.com/workingjubilee/2d2e3a7fded1c2101aafb51dc79a7ec5

running 10 tests
test string_write_fmt    ... bench:      10,053 ns/iter (+/- 1,141)
test string_write_macro1 ... bench:      10,177 ns/iter (+/- 2,363)
test string_write_macro2 ... bench:      17,499 ns/iter (+/- 1,847)
test string_write_ref    ... bench:       2,270 ns/iter (+/- 265)
test string_write_value  ... bench:       2,333 ns/iter (+/- 126)
test vec_write_fmt       ... bench:      15,722 ns/iter (+/- 1,673)
test vec_write_macro1    ... bench:      15,767 ns/iter (+/- 1,638)
test vec_write_macro2    ... bench:      23,968 ns/iter (+/- 8,942)
test vec_write_ref       ... bench:       2,296 ns/iter (+/- 178)
test vec_write_value     ... bench:       2,230 ns/iter (+/- 235)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out

I found it interesting that this example revealed that String actually had less overhead than Vec on write! (the benchmark had high variance but I considered it indicative enough).

Then I looked and saw:

    fn write_fmt(&mut self, fmt: fmt::Arguments<'_>) -> Result<()> {
        // Create a shim which translates an io::Write to a fmt::Write and saves
        // off I/O errors. instead of discarding them
        struct Adaptor<'a, T: ?Sized + 'a> {
            inner: &'a mut T,
            error: Result<()>,
        }

        /* More code related to implementing and using the resulting shim,
         * seemingly involving a lot of poking a reference at runtime??? */
    }

So I am no longer surprised.
I believe it is possible to fast-path some cases, but this is not actually an apples-to-apples comparison here. The orange is write_fmt, which has to use a variety of formatting machinery. What needs to happen is that write! should skip write_fmt in simple cases where the formatting machinery is not required, or the formatting machinery should itself be much faster in the case of the obvious fast path.

Was this page helpful?
0 / 5 - 0 ratings