pandas.DataFrame.query to allow column name with space

Created on 28 Feb 2014  ·  41Comments  ·  Source: pandas-dev/pandas

Able to do something like this would be nice

df.query('[col with space] < col')

I came across many external data files which have spaces in the column names. It would be nice to be able to do quick analysis on the data without first renaming the columns.

API Design Indexing Reshaping

Most helpful comment

R and dplyr use backtick( `` ) to quote column names with space and special characters. For example:

df$`column with space`
df %>%
    mutate(`column with space` = 1)

I wonder how hard it is to incorporate this into the pandas parser, so that functions like query, eval or even df. dot syntax can deal with more general column names?

All 41 comments

Hm. Many datasets _do_ have this issue, but consider the amount of code needed to rename columns versus the amount of code needed to parse that new syntax (or something similar).

Renaming columns is straightforward:

cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, unicode)) else x)
df.columns = cols

This is well tested and easy to debug. For more complicated replacement you can use regular expressions.

The things that go into parsing are markedly less straightforward:

  1. How would this tokenize? (This isn't so straightforward to me, there might be an easy way to do it, but it seems like you'd have to create a new token to disambiguate a syntax error from list construction and then map the whole thing to a valid Python identifier).
  2. How would this parse? (not too hard, it parses as a ColumnNode or some such object which gets looked up as a column in the frame)

Something that might be useful as a happy medium is a df = pd.clean_columns(df) function that will rename your columns to be valid Python identifiers so that you don't have to think too hard about it.

I am not sure about the implementation. Maybe we can use normal brackets instead like

df.query('(col with space) < col')

Anyway, the clean_columns method sounds like a good idea to me. Maybe it could also be a keyword argument in the read_csv function.

@socheon Let's try to keep our discussion to a single issue (I tend to get off topic too!). If you think there should be a clean_columns argument to read_csv, please open another issue.

@jreback What do you think about a top-level clean_columns function?

I think if u put quotes around the column name it might work on master

the fix for allowing & and | makes these be treAted like single tokens

It will be treated as a string, which is then turned into an internal temporary so that won't work.

df.query("'a column with a space' > 2") -> df.query("tmp_var_str_some_hex_value > 2")

Maybe allowing the column to be referenced by its clean version?

df.query("column_with_space > 2")

I do this cleanup for autocompletion whenever it makes sense.

Column names with spaces, dots, brackets and other invalid characters may be optionally auto-replaced by equivalent valid characters, such as underscore. This is also very handy for accessing columns as members of dataframe with dot syntax.

Hello,
I want to make a pledge for a resolution of this issue.

It is a very important issue for us. There are cases where we cannot change the column names because they need to be preserved.

That pandas puts an arbitrary requirement on column names is, IMHO, a bad design decision and bad programming practice.

Sorry for complaining, but I really think that pandas should fix this issue properly. Please consider it. Thank you.

@dgua you are welcome to submit a pull-request to fix. Note that in reality .queryis just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.

In any event, the recommended 'main' way of indexing has always been:

In [10]: df = DataFrame({'foo bar': [1, 2, 3, 4]})

In [11]: df[df['foo bar'] > 2]
Out[11]: 
   foo bar
2        3
3        4

jreback, I agree with you. However, we do have people who use .query extensively in their code because it's easier/more readable, so now we are in a conundrum to either change all that code or rename the columns, neither of which is desirable...

@dgua as I said, a pull-request from the community would get this done. I simply don't have time.

R and dplyr use backtick( `` ) to quote column names with space and special characters. For example:

df$`column with space`
df %>%
    mutate(`column with space` = 1)

I wonder how hard it is to incorporate this into the pandas parser, so that functions like query, eval or even df. dot syntax can deal with more general column names?

I like the proposal of zhiruiwang and looking at his upvotes others do to.

So I have looked into the code a bit and basically pandas does some preprocessing on the expression and passes it to numexpr. Next to that a localdict containing among others the names of the dataframe columns is passed with it.

I was thinking to just alter the expression by replacing every space within backticks to something else (like "_" or something less used to prevent name clashes) and remove the backticks. This creates a valid expression for numexpr. Next to that do the same to spaces in the names of columns when they are passed to the resolvers which eventually make up the localdict, so that the correct names can still be found by numexpr.

Maybe the code that will do this can look like this, but I have not tested it. I first like to hear what others think of the idea.

# Don't know if "_" is a good choice and don't know where to place this variable, 
# since it has to be constant in two different files and ideally is only defined once.
SEPERATOR_REPLACING_SPACES = "_"

# Replace spaces in variables surrounded by backticks:
# pandas/pandas/core/computation/expr.py 

import re

...

# new function
def _replace_spaces_backtickvariables(source):
    return re.sub(r'`(.*?)`', 
                  lambda m: m.group(1).replace(" ", SEPERATOR_REPLACING_SPACES), 
                  source)

...

# adjusted function
def _preparse(source, f=compose(_replace_locals, _replace_booleans,
                                _rewrite_assign), g=lambda x: x):
    ...
    g : callable
        This takes a source string and returns an altered one
    ...
    assert callable(g), 'g must be callable'
    source = g(source)

...

# adjusted class
class PandasExprVisitor(BaseExprVisitor):

    def __init__(self, env, engine, parser,
                 preparser=partial(_preparse, 
                                   f=compose(_replace_locals, _replace_booleans)
                                   g=_replace_spaces_backtickvariables)):

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

EDIT: fixed _replace_spaces_backtickvariables (regex should've been lazy)

@zhiruiwang The df. syntax won't be able to use this since it's parsed by Python, but query and eval would work.

@jreback You seem the one knowing most about this.

What do you think of my approach as explained in the previous comment?

Instead, we could also decide to solve it as dalelung proposed. So allow "dirty" names to referred to by their "clean" names ("this column name" can be referred to by "this_column_name" without the column actually changing the name) and don't use the `` encapsulation at all.

This would than only require this single line to be changed. (If I understand the code correctly, not tested.)

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

What do you think? Which approach do you think is best?

Than I can try to make a pull request for it.

This feature would be nice but I resolve with commands like these:

Replace White Space

df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1)

Starts with Numeric Value

df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1)

Sorry, but that’s unsatisfactory. How do I get back to the original columns? What if ‘_’ is also used as part of a column name?

Users shouldn’t have to do this and it’s a serious problem of pandas.

DG

On Jan 19, 2019, at 11:49, bscully27 notifications@github.com wrote:

This feature would be nice but I resolve with commands like these:

Replace White Space

df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1)

Starts with Numeric Value

df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/6508#issuecomment-455809809, or mute the thread https://github.com/notifications/unsubscribe-auth/ADtPtKBTb6T5ErsqMrQYFSVblj8eYOTKks5vE3a3gaJpZM4Bl2In.

in sql, you just use square brackets and that's nice.

@dgua do you have time to submit a PR? @zhiruiwang's suggestion of using backticks seems reasonable, if it can be implemented (we couldn't use them before, since they already have a meaning in Python 2).

@TomAugspurger I would have some time after my exams, and already took a look into the code for the implementation (see my previous comments). I still have some questions:

  • What should SEPERATOR_REPLACING_SPACES be? If we pick just "_", instead of a more complex replacement string, it might interfere with already existing other columns, but if we choose "_" it has the extra feature that you can refer to this column name by `this column name` _and_ this_column_name. (We could even drop the backticks feature.)

  • SEPERATOR_REPLACING_SPACES has to be consistent over two files. So, should it be declared somewhere else and imported into these files to make sure they are the same, or would a simple comment mentioning this suffice?

  • In the code I will alter, callable() is used. callable() is not available in 3.0-3.2, so should I also fix this at the same time?

I'm not familiar with this code, so you may be the expert here :)

In the code I will alter, callable() is used. callable() is not available in 3.0-3.2

Python 3.0 - 3.2? We require 3.5+

I think the backtick idea is rather good now, since in the eval or query context you wouldn't need to worry about how things are implemented. I wouldn't want to drop it.

For the plain Python context, the new names could have a prefix or suffix to prevent collisions (e.g. they all end with an underscore).

have a prefix or suffix to prevent collisions

@beojan yes, but maybe you can see those "collisions" as a feature. So you can refer to this column name by `this column name` _and_ this_column_name. I also think it is a bit silly if you have two columns names this column name and this_column_name, but you never know.

You can use this_column_name_.

If you really have two colliding columns, I don't see how that's a feature though.

Well, if I have a dataframe:

    "column name" "name"
1   4              5
2   2              1

With the feature implemented, without measures for colliding, I can now say:

df.query(column_name > 3)

And pandas would automatically refer to "column name" in this query. This is also earlier suggested by dalejung. You can now also leave the support for backticks out.

I also don't think you would see any dataframes _in the wild_ that looks like:

    "column name" "name" "column_name"
1   3              5     6
2   2              1     9

In which the collisions would cause a problem.

So, in my view, it won't cause any collisions and gives an extra way to refer to the column.

I don't think we'd be interested in partially supporting this. The point of
the PR would be to completely avoid ambiguity,
so the last example should work.

On Wed, Jan 23, 2019 at 10:56 AM hwalinga notifications@github.com wrote:

Well, if I have a dataframe:

"column name" "name"

1 4 5
2 2 1

With the feature implemented, without measures for colliding, I can now
say:

df.query(column_name > 3)

And pandas would automatically refer to "column name" in this query. This
is also earlier suggested by dalejung. You can now also leave the support
for backticks out.

I also don't think you would see any dataframes in the wild that looks
like:

"column name" "name" "column_name"

1 3 5 6
2 2 1 9

In which the collisions would cause a problem.

So, in my view, it won't cause any collisions and gives an extra way to
refer to the column.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/6508#issuecomment-456880863,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIkl6f9IgUGBKlzranZpsOe3atH3Aks5vGJQ7gaJpZM4Bl2In
.

@TomAugspurger Okay, that is clear.

Do we than go with a suffix like "_" to prevent collisions, or do we go with replacing the space between the column names with a complex string, so we can rule out even more accidental collisions.

this might work directly in the AST
I think these actually parse correctly but we don’t handle the reassembly correctly

Hi everyone,

Is there a timeline when will this feature be released?

I think it's done in master, given the commit mentioned above.

this will be in 0.25 - prob in a month or 2

Don't want to ruin the fun, but to prevent any disappointments: Pandas 0.25 will only be available for Python3. Also see https://pandas-docs.github.io/pandas-docs-travis/install.html#install-dropping-27

@jreback Thanks for the information. I saw that the 0.25.0 milestone is due for May 1.

@hwalinga No worries. This will be used in Python3 only codebase.

This would have been a good opportunity to allow column name that contain dots. Why was this not included?

@danielhrisca

I implemented this and we didn't thought of it. You can open a new issue to bring it up again. There might be a good reason I am not aware of, but the maintainers are, that this is best not to be allowed.

I think I can implement it again. It won't have the same solution as the space however. The reason is that the query string is parsed as python source code. You have to apply workarounds if you want certain syntax to be interpreted differently. And there probably won't be implemented a custom parser for this function.

@hwalinga HI
I have encountered the same problem, but i table name is point

snv_df.query('Gene.refGene in ["MSH2","MSH3","MLH1","MLH3","MSH6","PMS2","PMS3"]'

I saw the changes you made in 0.25.0.
I have a suggestion,
When I read the source code of a perl software annovar, I found that can use hexadecimal to represent special symbols.
f.g.

comsic=comsic\x3dxxxxxxxxxxxxxxxxxxx

Use a function to correspond to input and output special symbols when parsing

@zhaohongqiangsoliva

I don't think I understand what you are trying to say. Can you elaborate a bit more on the problem you want to solve?

@hwalinga
Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example

.      /           \     

and I give a proposed change is to use Hex

I second this. ANY character should be allowed in column names.

Thanks.

On Jun 22, 2019, at 18:01, zhaohongqiangsoliva notifications@github.com wrote:

@hwalinga https://github.com/hwalinga
Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example

. / \
and I give a proposed change is to use Hex


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/6508?email_source=notifications&email_token=AA5U7NAIAJXVHI7LYXNCGNDP33DO5A5CNFSM4AMXMIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYKUG3I#issuecomment-504709997, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5U7NCACMBXVVQT7Z4DUE3P33DO5ANCNFSM4AMXMITQ.

@dgua @zhaohongqiangsoliva

The problem is that the query has to become a valid Python expression. Using hex for disallowed characters won't solve this problem. Allowing spaces in the name is already based on hacking around the tokenize function (from tokenize import generate_tokens). I don't think it is impossible to allow more characters in the name, but it will be based on hacking around the tokenize function again. (Pandas won't role their own Python parser.)

If you really want this, you are off course free to open a new issue addressing this, and if you tag me in the issue, I will explain my solution to the maintainers.

Was this page helpful?
0 / 5 - 0 ratings