Pandas: BUG: concat unwantedly sorts DataFrame column names if they differ

Created on 17 Aug 2013  ·  36Comments  ·  Source: pandas-dev/pandas

When concat'ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they're identical across DataFrames, they don't get sorted.
This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:

df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))

print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted:           C         B         D         A

print "Cols sorted", concat([df4a,df5])
# Cols sorted           A         B         C         D         E
``'
API Design Reshaping

Most helpful comment

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

All 36 comments

Looking at this briefly I _think_ this stems from Index.intersection, whose docstring states:

Form the intersection of two Index objects. Sortedness of the result is not guaranteed

Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result...

@smcierney what order would you expect instead?

I found the auto sort was a bit annoying too (well, I should say depends on your purpose), because I was trying to concat a frame to an empty frame in a loop (like append an element to a list). Then I realized my column order changed. This change also applies to index, if you are concatenating along axis=1.

In a case similar to that of @smcinerney , I expect the final order of CBDAE. E shows up last because the order CBDA shows up first when concatenating.

Therefore I wrote a "hack" (kinda silly though)

sorted = pd.concat(frameList, axis=axis, join=join, join_axes=join_axes, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)

if join_axes:
    return sorted
elif sort:
    return sorted
else:
    # expand all original orders in each frame
    sourceOrder = []
    for frame in frameList:
        sourceOrder.extend(frame.Columns()) if axis == 0 else sourceOrder.extend(frame.Indices())
    sortedOrder = sorted.Columns() if axis == 0 else sorted.Indices()

    positions = []
    positionsSorted = []
    for i in sortedOrder:
        positions.append(sourceOrder.index(i))
        positionsSorted.append(sourceOrder.index(i))
    positionsSorted.sort()

    unsortedOrder = []
    for i in positionsSorted:
        unsortedOrder.append(sortedOrder[positions.index(i)])

    return sorted.ReorderCols(unsortedOrder) if axis == 0 else sorted.ReorderRows(unsortedOrder)

The function is included in my personal module called kungfu! Anyone can adopt the above algorithm, or have a look at my module at https://github.com/jerryzhujian9/kungfu

Finally, I greatly appreciate the work of the development team for this great module!

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

I've just come across this too.

new_data = pd.concat([churn_data, numerical_data])

Produced a DataFrame:

     churn  Var1  Var10  Var100  Var101 
0      -1   NaN    NaN     NaN     NaN     
1      -1   NaN    NaN     NaN     NaN

It would seem more natural for the numerical DataFrame to be concatenated without being sorted first!!

well, this is a bit of work to fix. but pull requests accepted!

Just stumbled upon this same issue when I was concatenating DataFrames. It's a little bit annoying if you don't know about this issue, but actually there is a quick remedy:

say dfs is a list of DataFrames you want to concatenate, you can just take the the original column order and feed it back in:

df = pd.concat(dfs, axis=0)
df = df[dfs[0].columns]

I believe append causes the same behavior, FYI

It's the default behaviour across the board. For example, if you apply a function, f, to a groupby() that returns a varying number of columns, the concatenation taking place behind the scene also auto-sorts the columns.

df.groupby(some_ts).apply(f)

Likely because the known order of the columns is open to interpretation.

However, this also happens for MultiIndices and all hierarchies in MultiIndices. So you can concat dataframes that agree on level0 columns and all bar one level1 columns, and all levels of the MultiIndices will be autosorted because of one mismatch within one level0 column. I don't imagine that is desirable.

I'd love to help, but unfortunately fixing this issue is beyond my ability. Thanks for the hard work all.

+1 for this feature

Agreed, +1. Unexpected sorting happens all the time for me.

+1, this was an unpleasant surprise!

+1, I hate having the columns sorted after every append.

+1 from me, as well.

Because even if I did want to manually re-order after a concat, when I try to print out the 60 + column names and positions in my dataframe:

 for id, value in enumerate(df.columns):
      print id, value

All 60+ columns are output in alphabetical order, not their actual position in the data frame.

So that means that after ever concat, I have to manually type out a list of 60 columns to reorder. Ouch.

While I'm here, does anyone have a way to print out column name and position that I'm missing?

+1 for this feature, just ran across the same deal myself.

@summerela Get the column index and then re-index your new dataframe using the original column index

# assuming you have two dataframes, `df_train` & `df_test` (with the same columns) 
# that you want to concatenate

# get the columns from one of them
all_columns = df_train.columns

# concatenate them
df_concat = pd.concat([df_train,
                       df_test])

# finally, re-index the new dataframe using the original column index
df_concat = df_concat.ix[:, all_columns]

Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren't sure how many columns may remain after prior processing steps in your script (maybe you're dropping zero-variance columns, for instance), you could pass a relative index position to new_indices --> new_indices = [-1] and it will take care of the rest.

def reindex_columns(dframe=None, columns=None, new_indices=None):
    """
    Reorders the columns of a dataframe as specified by
    `reorder_indices`. Values of `columns` should align with their
    respective values in `new_indices`.

    `dframe`: pandas dataframe.

    `columns`: list,pandas.core.index.Index, or numpy array; columns to
    reindex.

    `reorder_indices`: list of integers or numpy array; indices
    corresponding to where each column should be inserted during
    re-indexing.
    """
    print("Re-indexing columns.")
    try:
        df = dframe.copy()

        # ensure parameters are of correct type and length
        assert isinstance(columns, (pd.core.index.Index,
                                    list,
                                    np.array)),\
        "`columns` must be of type `pandas.core.index.Index` or `list`"

        assert isinstance(new_indices,
                          list),\
        "`reorder_indices` must be of type `list`"

        assert len(columns) == len(new_indices),\
        "Length of `columns` and `reorder_indices` must be equal"

        # check for negative values in `new_indices`
        if any(idx < 0 for idx in new_indices):

            # get a list of the negative values
            negatives = [value for value
                         in new_indices
                         if value < 0]

            # find the index location for each negative value in
            # `new_indices`
            negative_idx_locations = [new_indices.index(negative)
                                      for negative in negatives]

            # zip the lists
            negative_zipped = list(zip(negative_idx_locations,
                                       negatives))

            # replace the negatives in `new_indices` with their
            # absolute position in the index
            for idx, negative in negative_zipped:
                new_indices[idx] = df.columns.get_loc(df.columns[
                                                          negative])

        # re-order the index now
        # get all columns
        all_columns = df.columns

        # drop the columns that need to be re-indexed
        all_columns = all_columns.drop(columns)

        # now re-insert them at the specified locations
        zipped_columns = list(zip(new_indices,
                                  columns))

        for idx, column in zipped_columns:
            all_columns = all_columns.insert(idx,
                                             column)
        # re-index the dataframe
        df = df.ix[:, all_columns]

        print("Successfully re-indexed dataframe.")

    except Exception as e:
        print(e)
        print("Could not re-index columns. Something went wrong.")

    return df

Edit: Usage would look like the following:

# move 'Column_1' to the end, move 'Column_2' to the beginning
df = reindex_columns(dframe=df,
                     columns=['Column_1', 'Column_2'],
                     new_indices=[-1, 0])

I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn't work:

dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns]

The columns still get sorted. Using lists intermediately resolved things, though:

Edit: I spoke too soon..


Follow-up: fwiw, column order can be preserved with chained .join calls on singular objects:

df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained

Could there be a parameter when creating dataFrame about ordering columns? Like order=False. Thanks a lot

just ran into this while creating a dataframe from a dictionary. Totally surprised me, was counterintuitive and defeated my whole purpose...

column names should be for clarity and the location of columns near each other is an organizational choice of the user to maintain coherency

@patricktokeeffe
Thanks for the pointer to join. Series objects don't have that method so I ended up writing a function:

def concat_fixed(ndframe_seq, **kwargs):
    """Like pd.concat but fixes the ordering problem.

    Converts Series objects to DataFrames to access join method
    Use kwargs to pass through to repeated join method
    """
    indframe_seq = iter(ndframe_seq)
    # Use the first ndframe object as the base for the final
    final_df = pd.DataFrame(next(indframe_seq))
    for dataframe in indframe_seq:
        if isinstance(dataframe, pd.Series):
            dataframe = pd.DataFrame(dataframe)
        # Iteratively build final table
        final_df = final_df.join(dataframe, **kwargs)
    return final_df

How's the efficiency on this?

On Wed, Aug 30, 2017 at 1:58 PM, Bryce Guinta notifications@github.com
wrote:

@patricktokeeffe https://github.com/patricktokeeffe
Thanks for the pointer to join. Series objects don't have that method so
I ended up writing a function:

def concat_fixed(ndframe_seq, **kwargs):
"""Like pd.concat but fixes the ordering problem.

Converts Series objects to DataFrames to access join method
Use kwargs to pass through to repeated join method
"""
indframe_seq = iter(ndframe_seq)
# Use the first ndframe object as the base for the final
final_df = pd.DataFrame(next(indframe_seq))
for dataframe in indframe_seq:
    if isinstance(dataframe, pd.Series):
        # Convert Series objects into DataFrames since
        # series objects do not have a join method
        dataframe = pd.DataFrame(dataframe)
    # Iteratively build final table
    final_df = final_df.join(dataframe, **kwargs)
return final_df


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/4588#issuecomment-326086636,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AG999MucF-NH5vHuKe-Zczq-jy9ziYkRks5sdbDogaJpZM4A6TeA
.

@MikeTam1021

I'm not going to benchmark it atm, but I think it would be a function of the size of your ndframes, the amount of them. It does create a new dataframe for each ndframe, so I imagine it is much less efficient than pd.concat.

It works fine for my purposes, but I'm using a small amount of ndframes (around 101) and relatively small amount of records for each ndframe (around 102).

My goal is to include all records from every dataframe while preserving order of those records, even if not all ndframes contain data for a given record.

I can't see why preserving column order (as much as possible) isn't the default behaviour of concat().

My workaround uses unique_everseen from the Itertools Recipes.

columns = unique_everseen([column for df in dfs for column in df.columns])
df = pd.concat(dfs)[columns]

Any updates on the status of this thread? I am currently using version 0.22.0 and there still seems to be no proper solution. Procrastination seems to be quite an issue here...

I would also like to note that similar behaviour can be found when concatenating columns, i.e. axis=1, but only when passing the dataframes in a dictionary:

>>> df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
>>> df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
>>> df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3, 5))

>>> pd.concat([df4a, df5], axis=1).columns
Index(['C', 'B', 'D', 'A', 'C', 'B', 'E', 'D', 'A'], dtype='object')
>>> pd.concat({'df4a': df4a, 'df4b': df4b}, axis=1).columns.levels
FrozenList([['df4a', 'df4b'], ['C', 'B', 'D', 'A']])
>>> pd.concat({'df4a': df4a, 'df5': df5}, axis=1).columns.levels
FrozenList([['df4a', 'df5'], ['A', 'B', 'C', 'D', 'E']])

Any updates on the status of this thread?

Still open.

Procrastination seems to be quite an issue here...

Procrastination? We have a lot of open issues. If you want to ensure this gets fixed in the next release then you're best bet is to put together a PR. Let us know if you need help getting started.

@jtratner: in case it wasn't obvious from my examples at the top, I'd expect the order to be:

  • shared columns, unsorted
  • columns unique to df1, unsorted (i.e. in order in which they occur in df1)
  • columns unique to df2, unsorted (i.e. in order in which they occur in df2)

This is what you get in other packages or languages, e.g. SQL. There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

Hey guys, 2 things. 1) Welcome to pandas! I suggest just using more python native types like dictionaries. Stop trying to turn python (or any language) into SQL. 2) This is not technically a bug. It is just an unwanted effect of the code. You can easily overcome it outside the context of the package, and that is what I would think is the correct answer unless someone here takes it upon themselves.

@MikeTam1021 please explain how to overcome this outside the context of the package. Thanks.

I’m pretty sure that’s exactly what people in this thread have been discussing. I see lots of good solutions above that should work.

@MikeTam1021 It's not about turning pandas into SQL (heaven forbid!), but I couldn't agree more with:

There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

Concatenating DataFrames should have the same effect as "writing them next to each other", and that implicit sort definitely violates the principle of least astonishment.

I agree. It shouldn’t. It also assumes an order to the columns, which is SQLish, and not pure computer science. You should really know where you’re data is.

I hardly use pandas anymore after discovering this and many other issues. It has made me a better programmer.

+1 on this

This works for me:

cols = list(df1)+list(df2)
df1 = pd.concat([df1, df2])
df1 = df1.loc[:, cols]

I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of concat AND introduced a warning about the usage. All within the same commit.

The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem.

I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas.

It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code.

It sounds like you can just set a global filter for this warning and then
drop that when everyone is upgraded.

Functionally, that's the same right?

On Thu, Oct 4, 2018 at 9:18 AM DavidEscott notifications@github.com wrote:

I have to bitch about how this patch is rolled out. You have
simultaneously changed the function signature of concat AND introduced a
warning about the usage. All within the same commit.

The problem with that is that we use pandas on multiple servers and cannot
guarantee that all servers have the exact same version of pandas at all
times. So now we have less technical users seeing warnings from programs
they have never seen before, and are uncertain if the warning is a sign of
a problem.

I can readily identify WHERE the warning is coming from, but I can't add
either of the suggested options because that would break the program on any
server running an older version of pandas.

It would have been preferable if you put the sorting capability in to
0.23, and added the warning to some later version. I know its a pain, but
it's rather obnoxious to assume that the users can immediately update all
deployments to the latest code.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/4588#issuecomment-427036391,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHItEhYfv5kqB-R-pDX4zyIh45hF7kks5uhhiWgaJpZM4A6TeA
.

@TomAugspurger There are a multitude of ways that we on our side can deal with this. Certainly filtering warnings is one. Its not great because the mechanics of warnings filters are a bit ugly...

  1. I would have to add the filter to multiple programs
  2. Not a great way to specify a specific warning to filter:

    • I can filter by module and lineno, but that isn't a stable reference,

    • I can filter by module and FutureWarning but then I wouldn't get any warnings at all from pandas and would be surprised by other changes,

    • or I can filter by your long multi-line message

  3. And then remember to take that filter out when everything is upgraded and it no longer matters.

In any case the deficiencies in the warnings module are certainly not something I can put at the foot of the pandas team.

Nor is it your fault that we have an older server we can't easily upgrade, so that would be the other thing I can do (just upgrade all the damn deployments). Ultimately, I recognize that I have to do that and that it is my responsibility to try and keep our deployments close together.

It just seems a bit bizarre to me that you were so concerned about a possible change in user visible end behavior that you added this sort option to what was previously an underspecified API, and yet have simultaneously thrown a warning at the programmer... both the warning and the proposed change in sort behavior constitute "user visible behavior" in my book, just of different severities.

I've answered a related question on SO.

Was this page helpful?
0 / 5 - 0 ratings