Julia: Return basic statistical functionality to Base

Created on 1 Jun 2018  ·  36Comments  ·  Source: JuliaLang/julia

It seems incredibly unfortunate to me, and indeed almost actively user-hostile, to remove basic functionality such as std from Base. @sbromberger summarized it well in Slack: "if the function is generally understood by a layperson, it should not be removed." You don't have to be a statistics expert to understand what a standard deviation is. The collection of basic statistics functionality in Base in previous versions of Julia was fairly lean and I think struck the perfect balance of "here are the basics, now load a package if you want to get more advanced." Indeed, in the release-0.6 branch, base/statistics.jl is a couple hundred lines of code, including whitespace and extensive documentation. Who was it hurting to have in Base?

maths

Most helpful comment

+1 for "moving these functions into a new stdlib/Statistics module which would ship with Julia" now.

All 36 comments

I note that mean and median are still in Base. What are the current criteria for moving things out?

The collection of basic statistics functionality in Base in previous versions of Julia was fairly lean and I think struck the perfect balance of "here are the basics, now load a package if you want to get more advanced."

This, 100%. I will fight the "npm-ification" of Julia as long as I'm a participating member of the community. Having things ready to go out of the box was a huge selling point for me.

Moving large chunks of "basic" functionality out of Base to other packages (_especially_ third-party packages; I'm willing to consider stdlib packages like Random and Iterators special cases, though I haven't really made up my mind as to whether this is a good thing yet) also makes the library ecosystem that much more fragile – while Pkg.add("StatsBase") isn't so much of a problem for an end user, it introduces a dependence fragility on any libraries that need a function like std now, which then propagates to other libraries / end users.

FWIW I agree.

Is there an official decision of where precisely the line is drawn between Base, stdlib and third-party packages?

I think this is a crucial definition that can totally affect the character and feel of a language, from Julia-the-language to Julia-the-platform. With the current strategy, the development of Julia-the-language is fast and nimble, but at the potential cost of a fragile/fragmented Julia-the-platform, as nicely pointed out above.

if the function is generally understood by a layperson, it should not be removed.

How about matrix multiplication? Or solving a linear system? These are things a layperson understand even better than std! Should we move LinearAlgebra back too?

Moving large chunks of "basic" functionality out of Base to other packages (especially third-party packages; I'm willing to consider stdlib packages like Random and Iterators special cases, though I haven't really made up my mind as to whether this is a good thing yet) also makes the library ecosystem that much more fragile – while Pkg.add("StatsBase") isn't so much of a problem for an end user, it introduces a dependence fragility on any libraries that need a function like std now, which then propagates to other libraries / end users.

I thought the plan was that StatsBase will be a stdlib (perhaps named Statistics?). It just felt unnecessary to move StatsBase into this repo, only to move it out soon again.

How about matrix multiplication?

Matrix multiplication is in stdlib, which is "close enough to Base" right now, I guess. I'm still not really sold on "small base, large stdlib" but as long as the functionality is being distributed and made available via a standard Julia install, I guess it's better than going third-party.

I thought the plan was that StatsBase will be a stdlib (perhaps named Statistics?).

That would perhaps change things somewhat, but currently, it's not that way, and in order to use std in 0.7, we now have to rely on a third-party package. That doesn't seem right.

Yes, the plan is to move StatsBase to stdlib/Statistics. I had proposed moving these functions into a new stdlib/Statistics module which would ship with Julia and then move the appropriate parts of StatsBase later but people preferred doing it this way.

@StefanKarpinski until that's done, can we please keep std (and any others) in Base?

No, because we can't move things out of Base later but we can move things into stdlib later.

so until then the burden falls on the library developers to change where they source their functions from? That seems ... wrong, somehow.

This wasn't an issue for Random, or any of the others (disregard Iterators for the moment). Why is it an issue now?

Why do these need to be removed from Base at all?

We already had this discussion and came to a pretty clear consensus at the time, I'm not inclined to rehash the whole thing.

Seems like there's fairly broad support now for undoing it.

PR open to revert the change. #27375

+1 for "moving these functions into a new stdlib/Statistics module which would ship with Julia" now.

Here's what I would suggest:

  • Return standard deviation to Base as stddev
  • Return variance to Base as variance
  • Leave the rest where they currently are in StatsBase

Yes, I'm 100% on board with that. Can you make a PR?

Yep, can do.

So before, the story was "trust me, std is a super special case that absolutely must be in Base". Now I see we're just casually extending that to var as well. W H A T E V E R

...they're basically the same thing though, so it would be weird to split them

I agree with @ararslan here. It's either both or none.

From a developer/maintainer it is just a mess to have code all over the place due to lay-man definitions (which really is more field-based than anything). The original base/statistics had the relevant code all together and doing these splits will make it hard to look up code (a huge component in transparency for reproducibility, security, etc.). I would be very happy with taking these out of base and just a robust stdlib/Statistics. Personally, I can't wait for most stdlib to move out, but other compromises could include (loading certain stdlib by default or very easy setting to accomplish this (e.g., R provides several ways to do this) or have easily customizable distributions (choose the packages or ecosystems you want and get a custom download/image)... Not sure if it has to be managed by Julialang, but could be third-party as long as there is a way to easily to accomplish this.

Are we still going to try to do something here, with the beta tagged? This issue is still on the milestone, so perhaps it is ok to try make this change. I am personally ok with stddev and variance.

Since they're already removed from Base, it would be non-breaking to put them back, regardless of the name, which means it can happen any time. That said, if we're going to do this, I think we should try to do it for 0.7. I haven't moved forward with this change as there doesn't appear to be broad agreement over the names stddev and variance.

Wow I am thoroughly astonished by reading this post!!! I had no idea that such a basic thing would be removed!

julia> MathConstants.eulergamma
γ = 0.5772156649015...

julia> std(rand(1000))
ERROR: std has been moved to the package StatsBase.jl.
Run `Pkg.add("StatsBase")` to install it, restart Julia,
and then run `using StatsBase` to load it.

So, this means that "euler gamma" is much more common and used more often than the standard deviation? I wouldn't think so.

γ is not in Base. It is in a package just like std. You would be doing,

using MathConstants: γ
γ

or

using StatsBase: std
std(rand(1e3))

Yes that point was clarified to me. But I also should clarify that using γ in a scientific development context is equivalent to having a dependency on julia 0.7, is it not? Since they come together, what difference does it make for the end user? Anyway, I am sorry this is off-topic I don't want to polute the issue with irrelevant stuff. I was just shocked that such a basic thing was considered less important than the euler constant.

What's the issue in having a dependency? Especially one that is just math constants (extremely light)? If you were to have it in Base it would be loaded with every session for everyone regardless of whether they need it or use it. If you don't want to have it, you could just copy it in your project, but why not reuse good, maintained code (in this simple case it would really not matter)? How often do people just want plain vanilla standard deviation? Usually you would want to do something else in addition such as LinearAlgebra or DataFrames at the very least or maybe plotting which would require other packages.

γ is not in Base. It is in a package just like std. You would be doing,

This is a bit disingenuous. It's true that it's not in Base, but it is in stdlib, which does not require any additional packages to be installed in order to use it. This is in contrast to std, which as of now requires the explicit installation of StatsBase from an external repository.

How often do people just want plain vanilla standard deviation?

Are you suggesting that this doesn't happen? Because this is precisely why the original issue was opened.

Usually you would want to do something else in addition such as LinearAlgebra or DataFrames at the very least or maybe plotting which would require other packages.

Not in LightGraphs' case.

@sbromberger I think it would be good to mention or at least refer to your personal reasons for not wanting external repositories, in order to provide the full background to folks reading this.

Also, for other discussion in this thread, it doesn't help bringing up examples of other stuff and comparing if feature x is more widely used than std. If something is better maintained elsewhere, or is not reasonably commonly used functionality, please make an independent case for it and we can figure out the right place for it.

@ararslan I do agree we should do it for 0.7 and stddev and variance are consistent with our general naming conventions. It's not a good idea to take up 3 letter names.

I think it would be good to mention or at least refer to your personal reasons for not wanting external repositories, in order to provide the full background to folks reading this.

Sorry, @ViralBShah. I'm all talked out about the larger issue – most of the core team is probably sick of hearing from me by now – and we've worked around this one by removing our use of std. I'd rather not rehash the specific situation I'm in, but I hope you'll indulge me anyway.

Two things I'd like to put out there for consideration:

1) The argument that "it's common enough and small enough that everyone's going to have it installed anyway" is dangerous for two reasons: first, it makes the edge cases that don't / can't follow this convention that much more difficult to satisfy AND more prevalent, and second, it increases the overall fragility of the ecosystem: when everyone's depending on a common set of third-party utilities, then you've got what is essentially a "core install" made up of things that can change at the whim of devs who might not share the same commitment to stability or multi-use (see below) as the language core team. (This is not to disparage other devs; it's just the way it is.) My opinion is that if code is "common enough" that everyone should have it installed anyway, then it should at a minimum be in stdlib.

2) It would be great if we could give some thought to separating data structures from functions that commonly operate on those data structures. As an example, there are lots of things one can do with sparse matrices other than perform linear algebra on them. Moving the data structures into a package that is domain specific (like linear algebra) ignores these use cases at the expense of added complexity for those applications that don't treat the structures the way that others do. Continuing the example, in a language like Julia, it's easy and natural for people to place an emphasis on linear algebra. There are those of us who see the language as more than just a fast way to perform LA operations, and the talk of moving sparse matrices to a package primarily focused on linear algebra really makes it feel like we're second-class citizens in this language. At least when sparse matrices are in Base or stdlib, I can feel some assurance that someone will understand the non-traditional use cases, as it's more likely that there's someone on the team whose primary interest is not linear algebra.

Finally (I promise), it seems as if the bulk of the development work is not on the data structures themselves; rather, it's on optimizing functions that operate on them. That is, SparseMatrixCSC has not significantly changed in at least 3 years. The argument that we need to move things out to third party packages to improve our ability to make quick changes falls flat here when we're talking about the data structure.

Also, for other discussion in this thread, it doesn't help bringing up examples of other stuff and comparing if feature x is more widely used than std.

I didn't intend to do this, and I apologize if I did.

Thanks for the opportunity to weigh in, and I apologize both for not directly answering your question, and for the length of my (non-)response.

I'd like to second Viral's point that it doesn't help to point to other things in Base like eulergamma. It's quite likely a bunch of other stuff should be removed too! When the stdlib directory was first created, initially around 2 functions moved out of Base. One might have argued "What? Are you saying these are the two least important functions in Base that must be removed, while everything else gets to stay?" No, of course not --- pretty soon stdlib had ~30 packages.

I'll also point out that my original proposal was to move the functions to a stdlib package. In light of @sbromberger 's situation we should reconsider that --- we didn't think stdlib vs. external StatsBase was such a big difference, but apparently in some environments it is.

I like the point about separating data structures and functions; julia's design makes that especially easy and natural (though apparently some people call it type piracy :) ).

I'd like to second Viral's point that it doesn't help to point to other things in Base like eulergamma.

You are right, I see the flaw in my example. Of course I didn't state it as the "absolute argument against the change", but only to get a point across.

That is, SparseMatrixCSC has not significantly changed in at least 3 years. The argument that we need to move things out to third party packages to improve our ability to make quick changes falls flat here when we're talking about the data structure.

To note, several parties have long had in mind an overhaul of the sparse data structures. That those data structures have not changed in the last three years reflects lack of developer bandwidth rather than lack of utility in being able to rapidly iterate when developer bandwidth exists :). Best!

There is quite a bit of sparse matrix experimentation outside of Base that people have done, and having something in Base has also deterred others from trying alternate ideas (because it would be so difficult to get anyone to consider using them).

But I don't want to make this about sparse matrices. :-)

Resolved by #27834

not a tutorial goes by that mean , median and std is missed
the argument is not only that's known by a lay person (a fifth grader?)
it's something that shows up for teachers analyzing their class grade data
(which is way more common than general statistics analysis),

Can we please, please, please get these three functions in base?

Was this page helpful?
0 / 5 - 0 ratings