Pandas: API: Define API for pandas plotting backends

Created on 9 Jun 2019 · 44Comments · Source: pandas-dev/pandas

In #26414 we splitted the pandas plotting module into a general plotting framework able to call different backends and the current matplotlib backends. The idea is that other backends can be implemented in a simpler way, and be used with a common API by pandas users.

The API defined by the current matplotlib backend includes the objects listed next, but this API can probably be simplified. Here is the list with questions/proposals:

Non-controversial methods to keep in the API (They provide the Series.plot(kind='line')... functionality):

LinePlot
BarPlot
BarhPlot
HistPlot
BoxPlot
KdePlot
AreaPlot
PiePlot
ScatterPlot
HexBinPlot

Plotting functions provided in pandas (e.g. pandas.plotting.andrews_curves(df))

andrews_curves
autocorrelation_plot
bootstrap_plot
lag_plot
parallel_coordinates
radviz
scatter_matrix
table

Should those be part of the API and other backends should also implement them? Would it make sense to convert to the format .plot (e.g. DataFrame.plot(kind='autocorrelation')...)? Does it make sense to keep out of the API, or move to a third-party module?

Redundant methods that can possibly be removed:

hist_series
hist_frame
boxplot
boxplot_frame
boxplot_frame_groupby

In the case of boxplot, we currently have several ways of generating a plot (calling mainly the same code):

DataFrame.plot.boxplot()
DataFrame.plot(kind='box')
DataFrame.boxplot()
pandas.plotting.boxplot(df)

Personally, I'd deprecate number 4, and for number 3, deprecate or at least not require a separate boxplot_frame method in the backend, but try to reuse BoxPlot (for number 3 comments, same applies to hist).

For boxplot_frame_groupby, didn't check in detail, but not sure if BoxPlot could be reused for this?

Functions to register converters:

register
deregister

Do those make sense for other backends?

Deprecated in pandas 0.23, to be removed:

tsplot

To see what each of these functions do in practise, it may be useful this notebook by @liirusuk: https://github.com/python-sprints/pandas_plotting_library/blob/master/AllPlottingExamples.ipynb

CC: @pandas-dev/pandas-core @tacaswell, @jakevdp, @philippjfr, @PatrikHlobil

API Design Clean Needs Discussion Visualization

Source

datapythonista

Most helpful comment

Here's an entry-points based implementation

diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py
index 0610780ed..c8ac12901 100644
--- a/pandas/plotting/_core.py
+++ b/pandas/plotting/_core.py
@@ -1532,8 +1532,10 @@ class PlotAccessor(PandasObject):

         return self(kind="hexbin", x=x, y=y, C=C, **kwargs)

+_backends = {}

-def _get_plot_backend(backend=None):
+
+def _get_plot_backend(backend="matplotlib"):
     """
     Return the plotting backend to use (e.g. `pandas.plotting._matplotlib`).

@@ -1546,7 +1548,14 @@ def _get_plot_backend(backend=None):
     The backend is imported lazily, as matplotlib is a soft dependency, and
     pandas can be used without it being installed.
     """
-    backend_str = backend or pandas.get_option("plotting.backend")
-    if backend_str == "matplotlib":
-        backend_str = "pandas.plotting._matplotlib"
-    return importlib.import_module(backend_str)
+    import pkg_resources  # slow import. Delay
+    if backend in _backends:
+        return _backends[backend]
+
+    for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
+        _backends[entry_point.name] = entry_point.load()
+
+    try:
+        return _backends[backend]
+    except KeyError:
+        raise ValueError("No backend {}".format(backend))
diff --git a/setup.py b/setup.py
index 53e12da53..d2c6b18b8 100755
--- a/setup.py
+++ b/setup.py
@@ -830,5 +830,10 @@ setup(
             "hypothesis>=3.58",
         ]
     },
+    entry_points={
+        "pandas_plotting_backends": [
+            "matplotlib = pandas:plotting._matplotlib",
+        ],
+    },
     **setuptools_kwargs
 )

I think it's quite nice. 3rd party packages will modify their setup.py (or pyproject.toml) to include something like

entry_points={
    "pandas_plotting_backends": ["altair = pdvega._pandas_plotting_backend"]
}

I like that it breaks the tight coupling between naming and implementation.

TomAugspurger on 19 Jul 2019

👍2

All 44 comments

I think keep things like autocorrelation out of the swappable backend API.

I think we’ve left things like df.boxplot and hist around because they have slightly different behavior than the .plot API. I wouldn’t recommend making them part of the backend API.

TomAugspurger on 9 Jun 2019

Here’s my start on a proposed backend API from a few months ago: https://github.com/TomAugspurger/pandas/commit/b07aba28a37b0291fd96a1f571848a7be2b6de8d

TomAugspurger on 9 Jun 2019

I think it's worth mentioning that at least hvplot (didn't check the rest) does already provide the functions like andrews_curves, scatter_matrix, lag_plot,...

May be if we don't want to force all backends to implement those, we can check if the selected backend implements them, and default to the matplotlib plots?

I assumed boxplot and hist behaved exactly the same, but just had shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the plot grid, but other than that I haven't seen any difference.

datapythonista on 9 Jun 2019

IMO, the main value of this option is the .plot namespace.

If users want hvplot's Andrew's curve plot, they should import the function
from hvplot and pass the dataframe there.

On Sun, Jun 9, 2019 at 7:17 AM Marc Garcia notifications@github.com wrote:

I think it's worth mentioning that at least hvplot (didn't check the
rest) does already provide the functions like andrews_curves,
scatter_matrix, lag_plot,...

May be if we don't want to force all backends to implement those, we can
check if the selected backend implements them, and default to the
matplotlib plots?

I assumed boxplot and hist behaved exactly the same, but just had
shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the
plot grid, but other than that I haven't seen any difference.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/26747?email_source=notifications&email_token=AAKAOIRLJHBMXMXKK2IG2NDPZTYFPA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXII77Y#issuecomment-500207615,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOISDHL6H7PVOOJAQXELPZTYFPANCNFSM4HWIMEKQ
.

TomAugspurger on 10 Jun 2019

I think that makes sense, but if we do that, I think we should move them to pandas.plotting.matplotlib.andrews_curves, instead of pandas.plotting.andrews_curves.

@TomAugspurger I need to check in more detail, but I think the API you implemented in https://github.com/TomAugspurger/pandas/commit/b07aba28a37b0291fd96a1f571848a7be2b6de8d is the one that makes more sense. I'll work on it once I finish #26753. I'll also experiment on whether it's feasible to move andrews_curves, scatter_matrix... to the .plot() syntax, I think that will make things simpler and easier for everyone (us, third-party libraries, and users).

datapythonista on 10 Jun 2019

What's the intention here regarding extra kwargs passed to plotting functions? Should additional backends attempt to duplicate the functionality of all matplotlib-style plot customizations, or should they allow keywords to be passed that correspond to those used by the particular backend?

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

The second option is not as nice from the perspective of interchangeability, but would allow other backends to be added with a more reasonable set of expectations.

jakevdp on 10 Jun 2019

👍2

I think that's up to the backend on what they do with them. Achieving 100%
compatibility across backends isn't really feasible,
since the return type isn't going to be a matplotlib Axes anymore. And if
we aren't compatible on the return type, I don't think backends
should bend over backwards to try to handle every possible keyword argument.

So I think pandas should document that **kwargs will be passed through to
the underlying plotting engine, and they can do whatever they please with
them.

On Mon, Jun 10, 2019 at 10:42 AM Jake Vanderplas notifications@github.com
wrote:

What's the intention here regarding extra kwargs passed to plotting
functions? Should additional backends attempt to duplicate the
functionality of all matplotlib-style plot customizations, or should they
allow keywords to be passed that correspond to those used by the particular
backend?

The first option would be nice in theory, but would require every
non-matplotlib plotting backend to essentially implement its own matplotlib
conversion layer with a long tail of incompatibilities that would
essentially never be complete (speaking from experience as someone who
tried to create mpld3 some years back).

The second option is not as nice from the perspective of
interchangeability, but would allow other backends to be added with a more
reasonable set of expectations.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/26747?email_source=notifications&email_token=AAKAOIS3IBV4XSSY7BPSCF3PZZY5LA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKH4AY#issuecomment-500465155,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIQ3GYOGAPUZ4LSNK2DPZZY5LANCNFSM4HWIMEKQ
.

TomAugspurger on 10 Jun 2019

👍1

I'm sorry if this is a stupid question, but If you define a plotting "API" which is basically a group of canned plots, wouldn't every backend produce more or less the same output? what new capability is this meant to enable? something like a pandas to vega exporter perhaps?

ghost on 14 Jun 2019

I don't think it's correct to say that every backend produces more or less the same output.

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

On the other hand, bokeh, altair, et al. are great for interactive charts, but aren't quite as mature as matplotlib for static charts.

Being able to produce both with the same API would be a big win.

jakevdp on 15 Jun 2019

👍1

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

and also pins Matplotlib down even more than we already are API wise. I think it makes sense for pandas to declare what style knobs it wants to expose and expect the backend implementations to sort out what that means. This may mean _not_ blindly passing **kwargs through and instead ensuring that the returned objects are "the right thing" for the given backend to be able to do after-the-fact style customization.

tacaswell on 15 Jun 2019

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

Thanks @jakevdp, yes, supporting interactive charts is a good goal.

Before things go too far down this particular avenue, here's an alternative solution.

Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input.

Advantages include:

Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification.
The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR.
Promoting a common language for interchange visualization in the pydata space
Which makes new tool more powerful because more widely applicable
Which makes the effort of writing them more reasonable. Basically, improved incentives.

Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't).

You know, frontend->IR->backend, like compilers are designed.

ghost on 15 Jun 2019

At least three packages already implement the API. All pandas needs to do is offer an option for changing the backend and document its use, which seems like a good bang for our buck.

On Jun 15, 2019, at 16:28, pilkibun notifications@github.com wrote:

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

Thanks @jakevdp, yes, supporting interactive charts is a good goal.

Before things go too far down this particular avenue, here's an alternative solution.

Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input.

Advantages include:

Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification.
The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR.
Promoting a common language for interchange visualization in the pydata space
Which makes new tool more powerful because more widely applicable
Which makes the effort of writing them more reasonable. Basically, improved incentives.
Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't).

You know, frontend->IR->backend, like compilers are designed.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger on 16 Jun 2019

We now merged #26753, and the plotting backend can be changed from pandas. When we split the matplotlib code we left the SeriesPlotMethods and FramePlotMethods in the pandas (not matplotlib) side. That was mainly to leave the docstrings in the pandas side.

But I see that what backends did was to reimplement those classes. So, currently we expect the backends to have one class per plot (e.g. LinePlot, BarPlot), but instead they implement a class with a plot per method (e.g. hvPlot, or the same names as pandas forpdvega`).

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

If that makes sense for everyone, I'll start by creating the abstract class and adapting the matplotlib backend we have in pandas, and once this is done, we adapt hvplot and pdvega (the changes there should be quite small).

Thoughts?

datapythonista on 21 Jun 2019

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

I think that on balance this approach will be cleaner. I can't speak to other plotting backends but at least in hvPlot different plot methods share quite a bit of code, e.g. scatter, line and area are largely analogous, and I'd prefer not to rely on subclassing to share code between them. Additionally, I think different backends should have the option to add additional plot types and exposing those as additional public methods seems like the simplest, most natural approach.

philippjfr on 21 Jun 2019

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

I think I'm +1 on letting backends define plot types not in pandas. But I won't probably implement it right now. We're planning to release pandas in around one week. And I think this will require a bit more thinking than blindly calling the methods of backends if user provides kind='foo' and the backend provides the method foo (for example, parameter validation, or it'll cause that some kind will be in the documentation and some not).

datapythonista on 21 Jun 2019

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

Yes, that's right. More concretely I'd prefer not to have to do this kind of thing:

class MPL1dPlot(MPLPlot):

    def _some_shared_method(self, ...):
        ...

class LinePlot(MPL1dPlot):
    ...

class AreaPlot(MPL1dPlot):
    ...

Sorry if that was not clear.

philippjfr on 21 Jun 2019

👍1

Very much in favor of a simpler API that is publicly exposed as the single function instead of the classes as now proposed in https://github.com/pandas-dev/pandas/pull/27009.

General question/remark on how the backend option now works. Assume I am the pdvega developer and make this backend available. That means that if users do pd.options.plotting.backend = 'pdvega', that the pdvega library needs to have a top-level plot function?
1) as a library author, that's not necessarily the function you want to publicly expose (meaning, for the top-level plot method from the library's point of view, it is not necessarily the API that you want your users to use directly) and 2) for this case you might actually want to be able to do pd.options.plotting.backend = 'altair' ? (in case altair developers are fine with that)
So basically my question is: does there need to be a exact 1:1 mapping on the backend name and what is imported? (which is now needed since it simply does an import of that provided backend string).

EDIT: I see that actually something similar was discussed in the PR #26753

jorisvandenbossche on 26 Jun 2019

If we make the decision that pandas doesn't know/limit which backends can be used (which I'm strongly in favor of making), we need to decide on how/what to call in the backends.

What it's been implemented and proposed in the PR I'm working on is that the option plotting.backend is a module (can be pdvega, altair, altair.pandas, or whatever), and that module must have a public plot function, that it's what we will call.

We can consider other options, like if the option is pdvega, we import pdvega.pandas, or we can name the function plot_pandas or whatever. I think the proposed way is the simplest, but if there are other proposals that make more sense, I'm happy to change it.

Another discussion is if we want to force the users to import the backends manually:

import pandas
import hvplot

pandas.Series([1, 2, 3]).plot()

If we do that, the modules can register themselves, they can also register aliases (so set_option can understand other names than the name of the module). They can also implement custom functions or machinery (e.g. context managers) to plot with certain backends,... Personally I think the simpler we keep things the better.

And while it could be nice to do pandas.set_option('plotting.backend', 'bokeh') to plot in bokeh, I think that implies two things I personally don't like:

pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.
It also implies that there is only one module to plot in bokeh. Which doesn't need to be true, and gives the wrong impression to users that you're plotting directly with bokeh, and not with a pandas plotting backend for bokeh.

datapythonista on 26 Jun 2019

@datapythonista thanks for the detailed answer. I am fine with keeping it now as is for the initial release (possibility for alias can always be added later).

If users want hvplot's Andrew's curve plot, they should import the function from hvplot and pass the dataframe there.

+1, I would also not expose all the additional plotting functions through the backend.

But about moving them to pandas.plotting.matplotlib, that seems like an unnecessary backwards incompatible break to me (assuming you meant not only moving the implementation).

jorisvandenbossche on 1 Jul 2019

pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.

If we use entrypoints to register extensions, then this does not have to be the case: having the package installed on the system will register the entrypoint and make it visible to pandas. For example, this is what Altair uses to detect various renderers that the user might have installed.

jakevdp on 1 Jul 2019

Also, for what it's worth, once this goes in I think I'd probably deprecate pdvega and move the relevant code over to a new package named pandas_altair or something similar.

jakevdp on 1 Jul 2019

🎉1

@datapythonista I think we should decide about the scope of the plotting backend API before 0.25.0 (not for the RC though).

You are in favor to keep the misc plotting functions (as well as hist / boxplot) ?

jorisvandenbossche on 3 Jul 2019

@datapythonista close this as we merged the PR?

jreback on 3 Jul 2019

@jreback I'd keep this open until we agree on the API, @TomAugspurger and @jorisvandenbossche didn't want to delegate to the backend anything except the accessor plots.

What I would do for the plotting pandas - backend is next.

For the release:

Leave things as they are, hvplot implements everything all the plots, the ones from the accessors and the ones that are not. And I thing delegating everything keeps things simple.
Not sure if I'd exclude from the above the register_converters. At least we should change the name from register_matplotlib_converters is we delegate them

For the next release:

I'd deprecate all the duplicates pandas.plotting.boxplot, Series.hist,...
I'd move all the plots to be called from accessors (andrew_curves, radviz, parallel_curves,...).

datapythonista on 3 Jul 2019

For an initial release of the backend API, I would rather be more conservative in what we expose, rather than including everything. It is much easier to add things later, than to remove.

I would personally also not move all those misc plots to the accessor (there might be some exceptions, like scatter matrix), IMO the andrew_curves and radviz etc are not "worth" a method.

That said: do we want to allow backends to implement additional "kinds" ? So we don't have to decide, as pandas, exactly which accessor methods can be available. If the user passes a certain kind or tries to access an attribute, we could still pass it to the backend plot with a custom __getattribute__.

jorisvandenbossche on 6 Jul 2019

Just to explain a bit why things are the way they are now. It's relevant because I'm not quite sure how to implement the changes you propose, or not exposing things in general. Not saying here that it can't be done in a different way, it's just to enrich the discussion.

The first decision was to move all the code using matplotlib to a separate module (pandas.plotting._matplotlib). By doing that, that module somehow became the matplotlib backend.

Everything that was public in pandas.plotting has been kept as public there. And to make things as simple as possible, every one of these functions, once called, it loads the backend (call to _get_plot_backend) and it calls the function there.

The public API for the user has no change at all, users still have the same methods and functions available. We're not exposing anything new.

How I understand things, if we decide that an existing plot like andrew_curves is not delegated to the backend, what this implies is that instead of getting the backend selected by the user, we will still select the matplotlib backend. Given that at least hvplot is already implementing andrew_curves, I personally don't see the point. If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream (I guess that's one of the reasonings?), may be we can default to the matplotlib backend anything that is missing in the selected backend?

About delegating any unknown kind of plot to the backend, I'm -1 on doing it right now. Surely it can make sense eventually. But I think having several plot kinds documented in pandas, and having extra ones that the we don't document, feels a bit hacky. I think it can wait for the next version, after we have feedback on how having different backends work for users, and we have more time to discuss and analyze in detail.

datapythonista on 6 Jul 2019

If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

I don't think we would be making the user's life harder. Instead of importing it from pandas.plotting, if they want a hvplot's version, they can simply import it from there. Which is something not possible for the DataFrame.plot method, as that is defined on the object. For me that is the main reason for the plotting backend.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream

For me it is not about being nice or that implementing everything would be required (it is totally fine if a backend does not support all plotting types, IMO), but rather an unnecessary expansion of the plotting backend API, which also ties ourselves to it.
If we would restart pandas from scratch, I don't think those misc plotting types would be included. But with the plotting backend API we are in some way starting something new.

Any other opinions about this?

jorisvandenbossche on 16 Jul 2019

Agreed with @jorisvandenbossche.

Just to make sure this isn't lost, I think @jakevdp's suggestion to use setuptool's entry points is worth considering to solve the import order registration issue: https://github.com/pandas-dev/pandas/issues/26747#issuecomment-507415929

TomAugspurger on 16 Jul 2019

@jorisvandenbossche how would you change that in the code? Instead of getting the backend defined in the settings for those methods, get the matplotlib backend? I think this is wrong conceptually, but I'm ok with it if there is agreement. Anything that reverts the decoupling of the matplotlib code from the rest I'm -1.

Since you mention that in a pandas from scratch we wouldn't include those plots, should we deprecate them? I'm +1 on moving all the plots that are not methods of Series or DataFrame to a third-party package. Or if any is important enough to be kept, to move it to be called with .plot() as the others.

datapythonista on 17 Jul 2019

i would deprecate the non standard plots in pandas
and move to an external package

jreback on 17 Jul 2019

Joris is offline for a bit.

I think when we’ve discussed this in the past, his and my position on theses is to just leave them untouched until they become a maintenance burden.

TomAugspurger on 17 Jul 2019

Just so we are in the same page, this is a summary of what we have, and my understanding of the state of the discussion:

Used as methods of Series and DataFrame (afaik we're all happy to keep them as they are, delegated to the selected backend):

PlotAccessor
boxplot_frame
boxplot_frame_groupby
hist_frame
hist_series

Other plots (under discussion whether they should be deprecated, delegated to the matplotlib backend, or delegated to the selected backend):

boxplot
scatter_matrix
radviz
andrews_curves
bootstrap_plot
parallel_coordinates
lag_plot
autocorrelation_plot
table

Other public stuff in pandas.plotting (under discussion too):

plot_params
register_matplotlib_converters
deregister_matplotlib_converters

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

For the converters and the other stuff, what we have now is surely not correct, since register_matplotlib_converters delegates to the selected plot, which can not be matplotlib. The options that I guess we can consider are:

Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend
Move them from pandas.plotting to pandas.plotting.matplotlib (which would imply making the matplotlib backend public, so I wouldn't)
Leave them as they are, and delegate to the matplotlib backend instead of the selected backend (I see this more as a hack than a good design decision, I'd prefer to keep pandas.plotting agnostic of which backends exist)

datapythonista on 17 Jul 2019

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

How do you find the "other plots" to be a maintenance burden? Looking at the history for the "misc" plots: https://github.com/pandas-dev/pandas/commits/0.24.x/pandas/plotting/_misc.py, we have ~10-15 commits since 2017. The majority are global cleanups applied to the entire codebase (so a small marginal burden). I only see 1-2 commits changing docs, and no commits changing functionality.

Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend

I don't think this would make sense. There are matplotlib-specific converters that we've written for matplotlib. Other backends won't have them. It probably shouldn't be part of the backend API.

TomAugspurger on 17 Jul 2019

I didn't mean those plots are a burden because of the amount of maintenance we've got in the last months of years, but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

Regarding the converters, I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases. But doesn't seem a problem if they don't, and those functions do nothing for some or all of the other backends. I'm also ok with option 2, but I don't find it as neat.

datapythonista on 17 Jul 2019

but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

They're already somewhat inconsistent with DataFrame.plot, though. The name "misc" implies that :) Does having a swappable backend make that any worse? To the extent that it's worth the churn on user code? I don't think so.

I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases.

I don't think so. The point of those converters is to teach matplotlib about pandas objects. Libraries implementing the backend won't have that problem, since they already depend on pandas.

TomAugspurger on 17 Jul 2019

Personally I think about it mainly in terms of managing complexity. Having a standard plotting API that is delegated to the backend via a single API is easy to understand, and to maintain. Users and maintainers just need to learn that there is a plot function with a kind argument, and that this will be executed in the selected backend.

Having in the backend a set of heterogeneous plots, that besides not following the same API, use a backend, but not the one selected for the other plots, but the Matplotlib one, adds too much complexity for everyone IMHO.

And the cost of moving them seems small to me, my guess is that not a big proportion of our users even know about those plots. And for the ones who do, they'll just need to install an extra conda package and use import pandas_plotting; pandas_plotting.andrews_curves(df) instead of pandas.plotting.andrews_curves(df).

To me seems a lot to win, at a small cost, but of course it's just an opinion.

datapythonista on 17 Jul 2019

Can we document that the swappable backend is just for Series/DataFrame.plot? That seems like a pretty simple rule.

TomAugspurger on 17 Jul 2019

Feels like a hack that adds unnecessary complexity to me; I don't think explaining it in the documentation makes it less counter-intuitive.

But anyway, not a big deal. If that's the preferred option, this is how I'd implement it, at least the increase in code complexity is minimal: #27432

datapythonista on 17 Jul 2019

Looking more closely at this now: if I understand correctly, the way that the plotting backend will be set is using:

pd.set_option('plotting.backend', 'name_of_module')

My understanding, then, is that if I want to make the following work:

pd.set_option('plotting.backend', 'altair')

then I will need the top-level altair package to define all the functions in https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py. I would prefer not to pollute Altair's top-level namespace with all these additional APIs that are not meant to actually be used by Altair users. In fact, I would prefer for altair's pandas extension to live in a separate package, so it's not tied to the release cadence of Altair itself.

If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend', 'altair') work correctly without hard-coding the altair package in pandas the way matplotlib is currently hard-coded, is that correct?

https://github.com/pandas-dev/pandas/blob/f1b9fc1fab93caa59aebcc738eed7813d9bd92ee/pandas/plotting/_core.py#L1550-L1551

If so, I would strongly advise rethinking the means by which this API is exposed in third-party packages.

My suggested solution would be to adopt an entrypoint-based framework that would let me, for example, create a package like altair_pandas that registers the altair entrypoint to implement the API. Otherwise users will forever be confused that pd.set_option('plotting.backend', 'altair') doesn't do what they expect.

jakevdp on 19 Jul 2019

Agreed. I think entry points are the way to go. I'll prototype something.

On Fri, Jul 19, 2019 at 1:16 PM Jake Vanderplas notifications@github.com
wrote:

Looking more closely at this now: if I understand correctly, the way that
the plotting backend will be set is using:

pd.set_option('plotting.backend', 'name_of_module')

My understanding, then, is that if I want to make the following work:

pd.set_option('plotting.backend', 'altair')

then I will need the top-level altair package to define all the functions
in
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py.
I would prefer not to pollute Altair's top-level namespace with all these
additional APIs. In fact, I would prefer for altair's pandas extension to
live in a separate package, so it's not tied to the release cadence of
Altair itself.

If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend',
'altair') work correctly without hard-coding the altair package in pandas
the way matplotlib is currently hard-coded, is that correct?

If so, I would strongly advise rethinking how this is enabled by
third-party packages. In particular, adopting an entrypoint-based framework
would let me create a package like altair_pandas that registers the altair
entrypoint. Otherwise users will forever be confused that pd.set_option('plotting.backend',
'altair') doesn't do what they expect.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/26747?email_source=notifications&email_token=AAKAOITQM7HH5X4SZ4IAPS3QAIAIBA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2ML5OQ#issuecomment-513326778,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOISFLHDGXLGQ3PUMNLDQAIAIBANCNFSM4HWIMEKQ
.

TomAugspurger on 19 Jul 2019

There was a point in time where what you say was mostly correct, but that's not the case anymore.

If you want pandas.options.plotting.backend = 'altair', in 0.25 you just need to have a function altair.plot(). At some point I thought would be better to call the function pandas_plot instead of simply plot, so it was specific in a backend that had other things, but we finally didn't make the change.

If creating the plot function in the top level of altair is a problem, we can rename it in a future version, or you can also have altair.pandas.plot, but then users will have to set pandas.options.plotting.backend = 'altair.pandas'.

You can surely change the option yourself once users do an import altair. And we could implement a registry of backends. But I think it'd be confusing for users if they do the pandas.options.plotting.backend = 'altair' and it fails, because they forgot the import altair before.

One last thing is to consider that we could possibly have more than one pandas backend implemented for altair (or any other visualization library). So, for me, that the name of the backend is not altair, is not necessarily a bad thing.

datapythonista on 19 Jul 2019

Here's an entry-points based implementation

diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py
index 0610780ed..c8ac12901 100644
--- a/pandas/plotting/_core.py
+++ b/pandas/plotting/_core.py
@@ -1532,8 +1532,10 @@ class PlotAccessor(PandasObject):

         return self(kind="hexbin", x=x, y=y, C=C, **kwargs)

+_backends = {}

-def _get_plot_backend(backend=None):
+
+def _get_plot_backend(backend="matplotlib"):
     """
     Return the plotting backend to use (e.g. `pandas.plotting._matplotlib`).

@@ -1546,7 +1548,14 @@ def _get_plot_backend(backend=None):
     The backend is imported lazily, as matplotlib is a soft dependency, and
     pandas can be used without it being installed.
     """
-    backend_str = backend or pandas.get_option("plotting.backend")
-    if backend_str == "matplotlib":
-        backend_str = "pandas.plotting._matplotlib"
-    return importlib.import_module(backend_str)
+    import pkg_resources  # slow import. Delay
+    if backend in _backends:
+        return _backends[backend]
+
+    for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
+        _backends[entry_point.name] = entry_point.load()
+
+    try:
+        return _backends[backend]
+    except KeyError:
+        raise ValueError("No backend {}".format(backend))
diff --git a/setup.py b/setup.py
index 53e12da53..d2c6b18b8 100755
--- a/setup.py
+++ b/setup.py
@@ -830,5 +830,10 @@ setup(
             "hypothesis>=3.58",
         ]
     },
+    entry_points={
+        "pandas_plotting_backends": [
+            "matplotlib = pandas:plotting._matplotlib",
+        ],
+    },
     **setuptools_kwargs
 )

I think it's quite nice. 3rd party packages will modify their setup.py (or pyproject.toml) to include something like

entry_points={
    "pandas_plotting_backends": ["altair = pdvega._pandas_plotting_backend"]
}

I like that it breaks the tight coupling between naming and implementation.

TomAugspurger on 19 Jul 2019

👍2

I didn't work with entry points, are them like a global registry of the Python environment? Being new to them I don't love the idea, but I guess that would be a reasonable way to do it then.

I'd still like to have both options, so if the user does pandas.options.plottting.backend = 'my_own_project.my_custom_small_backend' it works, and doesn't require creating a package, and setting entry points.

datapythonista on 19 Jul 2019

I didn't work with entry points, are them like a global registry of the Python environment?

I haven't used them either, but I think that's the idea. From what I understand, they're from setuptools (but packages like flit hook into them?). So they aren't part of the standard library, but setuptools is what everyone uses anyway.