Data.table: Integration with magrittr

Created on 5 Jul 2015  ·  39Comments  ·  Source: Rdatatable/data.table

This is a feature request following the discussion on the mailing list.

I think it would be useful to have something like this as a short-hand form:

DT[, a %<>% some.function] 

So far one has to type

DT[, a := a %>% some.function]

or without magrittr

DT[, a := some.function(a)]

This is particularly important if a is replaced with a variable that has a long name, which is then difficult to type and read. I think there are significant savings in (programmer) efficiency to be made here, especially with longish variable names.

Most helpful comment

To summarize, so the issue can be eventually resolved.

All we need is to handle the following translation.

DT[, a %<:>% fun] ## or "%:>%"

DT[, a := fun(a)]

Is that right?

how should it behave if a is not a symbol but character variable?

DT[, "a" %<:>% fun]

DT[, "a" := fun(a)]   ## this?
DT[, "a" := fun("a")] ## or this?

what if its length is not 1?

DT[, c("a","b") %<:>% fun]

DT[, c("a","b") %<:>% fun(a, b)]
DT[, c("a","b") %<:>% fun("a","b")]
DT[, c("a","b") %<:>% lapply(list(a, b), fun)]
DT[, c("a","b") %<:>% lapply(c("a", "b"), fun)]

Personally speaking I would close it as won't fix because of adding quite a lot complexity and not solving any new problem.
I see agreement on that, thus closing, we can always re-open if really needed.

All 39 comments

DT[, a := some.function(a)]

Works perfectly fine

But imho

Complicated_data_table_variable_name[, a_very_very_very_very_long_variable_name := some.function(a_very_very_very_very_long_variable_name)]

isn’t perfectly fine. I like the idea of adding this convenience function.

But maybe %:>% would be better than %<>%?

You shouldn't have such strange names in your data set. It is both inconvenient and hardly maintainable. Other than that, you can store the column name in some variable and then do:

shortname <- "a_very_very_very_very_long_variable_name"
DT[, (shortname) := some.function(get(shortname))]

You're right, but even with variables that have intermediate length I still find the magrittr syntax much more convenient to read and write. Anyway, this is just my personal opinion.

I find that it’s sometimes better to have long variable names in complex data sets to make it clear what is saved in a variable. It is a matter of personal preference. Convenience function are per definition not required to perform a task, they just make it faster to code and often easier to understand. I have no doubt that this function would be of use to many users. But I also understand if the data.table devs don’t want to implement/maintain (too many) convenience functions, you have to draw the line somewhere ;)

For those of you who are subscribed to this thread, please disregard the last comment (now deleted). It was silly.

Building from @and3k 's comment, I see some value of :

DT[, a %:=>% some.function]

Think that reads better (i.e. a := and %>% together). It's a 'happy pipe'? I'm a fan of efforts to reduce variable name repetition, as written here: http://stackoverflow.com/a/10758086/403310

The => part of the :=> operator has some extra meaning, maybe :=: ?

DT[, a %:=:% some.function]

or :=. which directly maps as := followed by . passed to fun

What extra meaning does => have? The > is nice because it conveys passing the LHS as an argument to RHS. Which is why Hadley changed from the original %.% to %>%.

My understanding was that a major part of motivation for moving to %>% was that it's much easier to type than %.% (I'm guessing a lot of the times trying to type %.% would accidentally result in %>%).

I mean _greater or equal_ operator.
And what about %:>% ? This would be easier to type than %:=>% or %:=:%.
and3k already mention that one above.

My vote is for %:>% or just :>.

The %'s are only there because R doesn't allow infix operators in the wild, right? Might as well keep the operators inside DT[] parsimonious.

Hadn't considered the typing aspect i.e. holding down shift for all characters in the operator is easier I assume. Makes sense.
:> doesn't parse unfortunately. What's inside [...] still has to be valid R syntax (all arguments are parsed always before being passed unevaluated to the function) so we can't make up new operators inside [...], still have to wrap with %'s.
Ok then %:>% looks good to me as well. Not like it's a huge priority but it wouldn't be hard to implement and good to have discussed.

Thank, %:>% looks good to me.

Just curious, why does :> not parse while := parses inside [....]? := isn't valid R syntax as well, is it?

@my-R-help it is valid syntax, see this Why is := allowed as an infix operator?

+1 I agree with the OP feature request and use of the magrittr syntax. It is the best and most obvious choice for several reasons.

  • Folks are already heavily using DT with itmagrittr
  • magrittr's syntax is ubiquitous at this point ... might even get its own rstudio hotkey ... and any other syntactical choice will likely end in confusion.

I strongly encourage you to not overthink this FR by introducing a new operator whose choice is just as arbitrary magrittr's choice was..

I strongly encourage you to not overthink this FR by introducing a new operator whose choice is just as arbitrary magrittr's choice was..

@ctbrown The proposal is for a pipe operator that does something different from the vanilla %>% from magrittr, a package that also has several other pipe operators. As long as it doesn't conflict with any of those, what's the problem?

I think the OP's FR request was sufficiently clear, i.e.
to specifically use %<>% as the combined-forward-pipe-and-assignment
operator. Presumably this is because magrittr already defines %<>% for
precisely this purpose. Since magrittr seems to be the dominant pipe
implementation and many people seem to be using %<>%, my students and
colleagues among them. It does not make sense to introduce another
operator for the exact same purpose. It makes much more sense to choose a
syntax that aligns to what the community has been exposed to or has already
adopted. See my point about ubiquity.

Let me ask you, what do you hope to gain by introducing another operator
that performs the exact same function in a different context? I can't see
any benefit. Any operator you choose will be as arbitrary as magrittr's.
So does it not make sense to make the whole system less arbitrary by simply
following magrittr's lead here rather than make still another arbitrary
syntactic decision?

On Thu, Oct 27, 2016 at 11:41 AM, franknarf1 [email protected]
wrote:

I strongly encourage you to not overthink this FR by introducing a new
operator whose choice is just as arbitrary magrittr's choice was..

@ctbrown https://github.com/ctbrown The proposal is for a pipe operator
that does something different from the vanilla %>% from magrittr, a
package that also has several other pipe operators. As long as it doesn't
conflict with any of those, what's the problem?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1208#issuecomment-256732710,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC5xaxIp36dTUCz9d5a5mU7CJUV6PaxCks5q4PBDgaJpZM4FSJR5
.

Does %<>% assign by reference? If not, then they are in fact _not_ doing the exact same thing.

Technically you are correct, magrittr's %<>% does not assign by-reference,
but this is besides the point. Within users' expectations, there is no
difference. The assignment whether by-reference or by-value is an
implementation issue not an interface one. The OP has suggested adopted
magrittr interface and did not necessarily suggesting the implementation.
I seen the merit in the OP suggestion. See the reasons above. I do not
see the rationale in adopting something like '%:>%or anything as arbitrary. The merit of this has not been articulated. The%<>%`
operator already exists and is actively promoted by magrittr (12th most
popular package according to METACRAN.) As much as anything this seems to
be the standard (within the R community). The nice thing about following
established practice is reducing user confusion and the need for
comprehensive documentation. You get: "Oh, this is the same as magrittr, I
know this forward pipe's and does an assignment", instead of "what is this
strange %:>%? Is that a new clown emoticon?"

On Thu, Oct 27, 2016 at 2:18 PM, Michael Chirico [email protected]
wrote:

Does %<>% assign by reference? If not, then they are in fact _not_ doing
the exact same thing.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1208#issuecomment-256772769,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC5xa3ssE14PcamL2U9HlvCdfA8-7Iz8ks5q4RUagaJpZM4FSJR5
.

Within users' expectations, there is no difference.

First, I do not think that is true and that you speak for all users; I am a user, for example. Second, if it is true, then these users should learn about the difference as they learn to use data.table.

Your "reasons above" do not hold water with me. It's not somehow going against magrittr to implement a non-overlapping pipe operator to do a distinct but related thing. To me -- and this is just my impression, just as much as everything you've been saying is yours -- this seems perfectly consistent with the "established practice" of magrittr (which I use almost as often as I use data.table).

It's perfectly possible that the use in this context would assign to multiple objects (columns) at once, which surely you would agree is quite distinct from %<>% ..? I mean

DT[, (cols) %:>% lapply(as.character) ]

And, besides modifying by reference and potentially modifying several things at once, we have the fact that we are modifying part of a thing (the data.table), which is quite different from %<>%.

Anyways, since the developers have not shown any sign of doing this task any time soon (by marking this FR with a priority or milestone), how about revisiting this if it actually moves forward?

@ctbrown _by-reference_ is not just different in implementation, and needs to be differentiated from _by-value_ functions in user interface. That's the whole point of set* functions and := operator in data.table, to clearly communicate to users what is actually modifying the input of a function. Hard to judge on a standard within R community after such a short while, R applications are being written for decades and it is quite too early to judge on "new standard", which in the end is AFAIK about code formatting (nesting/unnesting), please correct me if I'm wrong. As I said many times before I found magrittr pipes really nice for interactive use when I want to present chunk of code, but not really necessary when writing R packages where the main focus is functionality. IMO if something can modified in-place it has to have different operator then the one that won't modify in-place.

@franknarf1,

First, there was no claim as to speaking for ALL R users. That is a ridiculous assertion. The reference to "users's expectations" was to my own, presumably the OP's and several of my students who have already tried dt[ , var %<>% ... ] and ask why it does not worked. Further, _Forcing_ users to "learn about the difference as they learn to use data.table" is dangerous if DT can work as one might expect. This leads to bad software design.

Second, as you point out, it is up to the individual to accept or reject arguments in favor of following the OP's suggestion of following the magriitr syntax. There have been some arguments offered as to why this would be beneficial, but few cogent argument offered why an alternative would be superior or even beneficial. There seems to be a minor argument that because the implementation is different, but that is a rather weak argument.

Additionally,

  • The %<>% could also be implemented to handle multiple arguments and would still make perfect sense within context while adhering to OP's and other's expectations. The OP's suggestion did not really specify whether he wanted to modify multiple arguments.
  • Whether you are modify part or whole is a nit -- you are always modifying part of a thing, i.e. an environment.
  • The action or inaction of the developers is another red herring and do not relate to the merits of the OP's proposal / request. This is a rather poor appeal-to-authority argument that has not really offered an opinion either way.

If there are arguments for/against the OP's suggestion, I would love to hear them. But the only thing I have heard is that, "because it is different". Maybe some will see this as valid, but weighed against the OP original suggestion, the alternative does not seem better.

Everyone that's ever used data.table has had to learn at sometime or another about using := (probably very soon after starting). Oh no, spooky! Why isn't it <- or =?

The answer to this is one of the first things anybody learns about using data.table. It's the topic of the second intro vignette.

%<>% vs. %<:>% (or whatever it may end up being) is exactly the same distinction. So the answer is covered by Matt here:

http://stackoverflow.com/questions/7033106/why-has-data-table-defined-rather-than-overloading

@jangorecki

First, most users do not need to know the distinguish the difference between by-reference and by-value. It is not a prerequisite of using DT that you know this. Presumably, this is why the DT syntax is so close to DF. @mattdowle could have clearly designed DT with a purely functional interface. He didn't. Presumably, one of reasons was that DT could function as a drop in replacement.

With respect to the set* functions, these may indicate by reference, but it is curious to note they were not named set*ByRef which would have been more clear. The functions seem to exist mostly for performing an efficient operation, turning a DF into DT and setting a key. That they can be taken to indicate a by-reference operation seems secondary.

As to :=, I think I recall @mattdowle being asked at useR UCLA why he used := instead of =. IIRC, he said he couldn't use = and := was available. IIRC, he would have preferred using =.

WRT, the standard in the R community -- notorious for it's lack of standards -- magrittr is as good as it gets: ubiquitously used and discussed. The OP suggests Interoperability with it would be a nice feature. I agree. If you have any doubts about this take a look at its CRAN page. Developers are using magrittr in their own packages. Moreover, writing packages is not the majority of R users. But this is really a digression from the topic.

The argument you offer falls under: "DT is different from magrittr since the assignment is by reference so the syntax is different". To which the response is still: The implementation is different, true. but the interface should be the same since it is effectively the the same operations for most users, conforms to user expectation and whose true operation can be inferred from context.

@mattdowle could have clearly designed DT with a purely functional interface. He didn't.

I'm glad he didn't. Locking into "purely functional" simply translates to dropping some important features that user is now capable to use in order to write faster and more memory efficient code. I have projects (i.e. anchormodeling) which would basically be impractical to use in a "purely functional" framework.

@jangorecki
I totally agree. But we are starting to digress from the original proposal to the merits of DT.

@MichaelChirico

Thanks for bringing a sense of enlightenment to the discussion. The references stray a bit from the original proposal, but they help illustrate the points in favor or the OP proposal, Namely,

  • The choice of operators is arbitrary. Matt Dowle tried several before :=. He first tried the obvious things that would have aligned more to the standard first: <-, <<-, := was just stumbled upon because it was available and alternatives made for ugly syntax. It was clear that he preferred existing operators to defining a new one.
  • The user will know understand from context that assignment occurs by-reference assignment, but it doesn't matter how it occurs.
  • Etc.

First, there was no claim as to speaking for ALL R users. That is a ridiculous assertion. The reference to "users's expectations" was to my own, presumably the OP's and several of my students

It is a counterproductive and distracting rhetorical device, I'd say, to refer to "users" when you really just mean yourself. You may also have noticed that the OP said "%:>% looks good to me."

The action or inaction of the developers is another red herring and do not relate to the merits of the OP's proposal / request. This is a rather poor appeal-to-authority argument that has not really offered an opinion either way.

It is not an appeal to authority since I am not arguing a point there. It as an appeal to you to calm down. This may never even be implemented, so can't you defer the fuss? I imagine it will be a trivial matter to switch the name of the function after it's implemented (if it ever is), and we'll have a better sense of what exact functionality we're looking at at that point.

As far as the substantive arguments go:

  • I think it very unlikely that magrittr is ever going to use %<>% to modify multiple objects on its LHS, like list(x, y) %<>% log, analogous to DT[, c("x", "y") := lappy(.SD, log), .SDcols = c("x","y")].
  • Similarly, yours is the first hint I've seen that %<>% might be used to modify part of an object like x[ 2:4 ] %<>% log, which is analogous to DT[ 2:4, x := log(x) ].

I look forward to seeing your FRs for these features on https://github.com/tidyverse/magrittr/issues and hope they go through, because I would certainly use that functionality.

@franknarf1,

Point taken; I had missed that the OP said that "%:>% looks good to me."

Notwithstanding, it is not just me. The OP first suggested the magrittr syntax, first. Presumably, he thought it a good idea despite conceding to an alternative later. I had also thought it a good idea, that is what brought me here and this was prompted by several students who have tried it. Presumably, there are others. Dismissing this as a lone viewpoint is kinda beside the point, anyhow.

Second, the argument was, in fact, an appeal-to-authority. It may as also be "an appeal for me to calm down", though I am perfectly calm. In any event, the point seems off topic, it does not address the merits of the OP suggestion. Also, the fact that this is very unlikely to be implemented does not seem to be relevant to the merits of the proposal.

It must further concede that you are correct. It will be trivial to change the name of the function once implemented. However, such change could and will likely break any code that is developed that uses the feature. It makes perfect sense to spend time discussing the interface before implementing rather than burdening the users with a incompatible change later. It is unclear what shutting down discussion serve a useful purpose.

As to the substantive arguments, they seem to advocate more for increased functionality of magrittr than address proposed %<>% syntax. (On a personal note, I agree with you that the magrittr folks should implement your suggestions, especially the first. I am not sure if I would do use the second that much. ) Regardless of the proposal to the magrittr folks, there is nothing inconsistent from DT from adopting your enhancements and using the magrittr %<>% operator. And I have yet to really read and a cogent argument of the the superiority of %:=% (or an alternate) to %<>% .

In an effort to get back on topic, I thought it might be useful to summarize the relevant arguments.

Argument in favor of %:>%:

  • that a DT pipe-forward-assignment will implement assignment by-reference. This needs to be distinguished in the operators since it makes it clear to the users how assignment is implemented. This may avoid unnecessary confusion.
  • there may be some cases (e.g. multiple assignment, partial assignment) where DT may add additional features that are not (yet?) supported by magrittr. Consequently, it is best to distinguish between the operations.
  • %<>% is as arbitrary as any other choice, %:>% is more DT-like.

Arguments in favor of %<>%:

  • magrittr has already implement a pipe-forward-assignment operators, it is widely used and as much as anything seems to be a standard. Following a standard, however implicit, means burdens of DT developers are reduced in the amount of documentation and explanation necessary to describe the new operator.
  • Users already use magrittr with DT in the RHS of :=. Further, some evidence suggests that users expect/have tried the magrittr %<>% syntax. Adopting it would conform to the expectations of these users.
  • Moving code to-from dplyr/magrittr <--> DT would be simpler and easier, since in some cases the syntax may be similar.
  • An increase in the number of operators leads to additional complexity. In this case the complexity is unneeded since operators should specify the action and does necessarily need to specify implementation -- this is how generic methods work.
  • %:>% is as arbitrary as %<>%.

Moving code to-from dplyr/magrittr <--> DT would be simpler and easier, since in some cases the syntax may be similar.

You're absolutely missing how a modification by reference would awfully break this kind of ported code.
Places where you did know your original object won't change will suddenly change because the assignment method change, this is really why it is important to distinguish the operators (and to keep the possibility of assigning by value (copy) in the same line of code also) .

That the same problem as when you copy a data.table vs copy a data.frame (dt2 <- dt), suddenly you scratch your head about why your orignal dt has been updated when you did work only on the second.

This exact precaution to take, invalidates also your first point, as it call for a precise documentation of what does the operator, using a different one will ease finding the correct documentation.

@tensibai,

Understood. Thus the "may be" part of the assertion.​

On Wed, Nov 2, 2016 at 1:32 AM, Tensibai [email protected] wrote:

Moving code to-from dplyr/magrittr <--> DT would be simpler and easier,
since in some cases the syntax may be similar.

You're absolutely missing how a modification by reference would awfully
break this kind of ported code. Places where you did know your original
object won't change will suddenly change because the assignment method
change, this is really why it is important to distinguish the operators
(and to keep the possibility of assigning by value (copy) in the same line
of code also) .

That the same problem as when you copy a data.table vs copy a data.frame
(dt2 <- dt), suddenly you scratch your head about why your orignal dt has
been updated when you did work only on the second.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1208#issuecomment-257801913,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC5xaxyWXkh4C7i5-GtjMFiQ0AY2L5BLks5q6EqIgaJpZM4FSJR5
.

Thanks for all your comments and feedback.

Just a minor thing (maybe I'm missing something): Does it in this specific case even make a difference whether it's assigning by reference or by value? What we want to do is to update a column (or several columns) inside the data.table. The user knows that the old column will get overwritten either way. There's no room for misunderstanding, is there? In contrast, this is very different than DTa=DTb vs. DTa=copy(DTb) (which I'm not talking about in this feature request), where we're dealing with the data.table itself, and where it does make a difference whether we assign by reference or by value.

@my-R-help,

Your intuition is correct.

It does not make a difference to the user how this operation is _implemented_. From users' perspective the results are the same -- values in the column are reassigned. There has been some arguments stating that there should be some differentiation, there hasn't been a cogent explanation as to _why_.

Your proposal of adopting the %<>% syntax is a sound one. It correctly assumes the implementation is distinct from the _interface_ and since there is an popular and extant practice for performing the operation, it should be adopted. This, in fact, follows good software design practice.

( As a side note, I was a little disheartened when you stated, "Thank (SIC), %:>% looks good to me." and did not more forcefully advocate for you initial intuition and proposal more forcefully. In any event, thanks for the proposal. It is brilliant whether it is implemented in DT or not,)

Thanks for your reply, Christopher.

Just to clarify, I personally have a preference for %<>% because it's more consistent with magrittr and a lot of people seem to be using it. However, if the data.table devs prefer another operator (e.g. %:>%), I can also live with that (although I personally prefer the magrittr way).

Maybe I should have phrased it that way. Sorry if it caused any confusion.

The user knows that the old column will get overwritten either way. There's no room for misunderstanding, is there?

It does not make a difference to the user how this operation is implemented. From users' perspective the results are the same -- values in the column are reassigned.

I still feel there's room for foot-gun with joins.

Having two operators behaving a little differently on their side effects named the same is error prone and _will_ lead to confusion. I can't argue better than that, but there's a reason on why R warns you when a package mask a base function or when loading a package overload another package function.

In my opinion, it does make a difference for at least some users to have specific operators when the side effects will be different.

_Bonus_ searching for the operator you'll end up on the DT page explaining it's caveats/limitations with no doubt instead of having two choices in the help.

Here we're talking about a language, not a user interface, while I agree on a final software user shouldn't care about the implementation behind X button, I highly disagree a programmer should not care about the implementation behind a function.

Major objection being: someone thinking %<>% will behave the same as outside a DT will turn crazy when it will scratch his DT columns when not intended.

TL;DR: Programming is not a UX, you have to be specific about what you want, hence reusing well-known names should not happen.

@Tensibai,

The claim that the side-effects are somehow different is dubious. In each case, a variable reassignment is being performed. They are both side-effects. The implementation (by-ref or by-value) doesn't truly distinguishes these since the comparative end states of both systems have changed in analogous ways.

Even if the side-effects are different. The distinction is rather unimportant. This point has been raised repeatedly in the above discussion. If the distinction were important, it should be possible to provide an example where it would make a difference to the user. The lack of a counter factual example while not conclusive is a strong indication that there is no distinction.

With respect to:

you have to be specific about what you want, hence reusing well-known names should not happen.

This is just wrong. Reuse of common, well-known names not only should happen, it is very common and is considered good programming practice. This is called polymorphism. It is perfectly acceptable to have methods with the same name that are implemented differently:

person.speak()
"hello, world"
dog.speak()
"woof"

speak was used in each case. Is this bad practice? No. In fact, if polymorphism is not adopted it would be a disaster; every function and method would have its own name. While this is a fairly generic example based on OO languages, R is no different. R has S3 Methods and Generic Function that work in similar ways.

The suggestion that:

I highly disagree a programmer should not care about the implementation behind a function.

is similarly flawed and is counter to most users experience. Most programmers probably use hundreds of functions/methods. They do so without knowing their implementation details. The user _does_ needs to know the input and the output/side-effects for the functions to be useful, but how it gets there is most often irrelevant. Granted, users sometimes needs to know details in order to tweak or debugged the function, but it can be argued that this in the vast minority of cases. Consider the world where the users had to know how each and every function worked at all levels. The cognitive load would be immense; programming anything of complexity would be an impossible task. With respect to:

Bonus searching for the operator you'll end up on the DT page explaining it's caveats/limitations with no doubt instead of having two choices in the help.

This is not a _Bonus_, but a liability by a) introducing confusion (how is this different from the very popular magrittr packages, exaclty?) and the b) creating the need for additional unneeded documentation in the first place. If the magrittr syntax, DT devs can say: "go there and read there docs and vignette; DT supports what they are doing there." This cooperation and cross package borrowing raises the value of DT, magrittr and the R ecosystem. )

Lastly, it might be inferred that from the comments about "user interface", "X button" and "UX" that there was a specific UI implied. That is simply not the case. And, while it is abundantly clear we are speaking about a language, it is erroneous to say that the language lacks an interface. The interface is its syntax and it is important.

To summarize, so the issue can be eventually resolved.

All we need is to handle the following translation.

DT[, a %<:>% fun] ## or "%:>%"

DT[, a := fun(a)]

Is that right?

how should it behave if a is not a symbol but character variable?

DT[, "a" %<:>% fun]

DT[, "a" := fun(a)]   ## this?
DT[, "a" := fun("a")] ## or this?

what if its length is not 1?

DT[, c("a","b") %<:>% fun]

DT[, c("a","b") %<:>% fun(a, b)]
DT[, c("a","b") %<:>% fun("a","b")]
DT[, c("a","b") %<:>% lapply(list(a, b), fun)]
DT[, c("a","b") %<:>% lapply(c("a", "b"), fun)]

Personally speaking I would close it as won't fix because of adding quite a lot complexity and not solving any new problem.
I see agreement on that, thus closing, we can always re-open if really needed.

Was this page helpful?
0 / 5 - 0 ratings