Data.table: Vignettes

Created on 11 Nov 2014  ·  54Comments  ·  Source: Rdatatable/data.table

HTML vignette series:

Planned for v1.9.8

  • [ ] Quick tour of data.table
  • [x] [Keys and fast binary search based subset](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-keys-fast-subset.html)
  • [x] [Secondary indices and auto indexing](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html)
  • [ ] Joins vignette. a) _joins_ vs _subsets_ -- extending binary search based subset to joins + conditional / non-equi joins, rolling and interval joins. b) by=.EACHI, join + update feature. c) Document i.col usage as filed in #1038. d) Also cover about performance/advantages from #1232.
  • [ ] Cover get() and mget(). E.g., http://stackoverflow.com/q/33785747/559784~~ covered in #4304
  • [ ] Add about on= argument rationale in FAQ (#1623).
  • [ ] FAQ 5.3 needs to mention that it's a _shallow_ copy that's done in order to restore over-allocation. Thanks to Jan for linking it in #1729.

Future releases

  • [ ] data.table internals, performance aspects and _expressiveness_
  • [ ] Reading multiple files (fread + rbindlist), ordering, ranking and set operations
  • [ ] IDateTime vignette
  • [ ] Document the difference between data.table() and data.frame() somewhere - relevant issues: #968, #877. Perhaps slightly more in detail in the FAQ.
  • [ ] coursera FAQ
  • [ ] Advanced data.table usage:

    • [ ] NSE

    • [ ] ...

  • [ ] Timings vignette (moving #520 here to get everything in one place, but not sure if we need it as a vignette since we've the Wiki with benchmarks/timings).
  • [ ] fread+fwrite vignette, include also Convenience features of fread wiki, also https://github.com/Rdatatable/data.table/issues/2855

Finished:

  • [x] [Introduction to data.table](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html) - data.table syntax, general form, subset rows in i, select / do in j and aggregations using by.
  • [x] [Reference Semantics](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reference-semantics.html) (_add/update/delete_ columns by reference, and see that we can combine with i and by in the same way as before)
  • [x] [Efficient reshaping using data.tables](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reshape.html)
  • [x] Link to this answer on SO on by=.EACHI until the vignette is done.

Minor:

  • [ ] Operations using integer64, and promoting it for _large integers_.

Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..

Introduction to data.table:

  • [x] order in i.
  • [x] Explain how to name columns in j while selecting/computing.
  • [x] Emphasise that _keyby_ is applied _after_ obtaining the result on the computed result, not on the original data.table.
  • [x] Mention new updates to .SDcols and cols in with=FALSE being able to select columns as colA:colB.

    Reference semantics:

  • [ ] Also explain all other relevant set* functions here.. (setnames, setcolorder etc..)

  • [ ] Mainly set.
  • [x] Explain that 1b) the := operator is just defining ways to use it - the example there doesn't work as it just shows two different ways of using it -- Following this comment.

    Keys and fast binary search based subsets:

  • [ ] Add an example of subset using integer/double keys.

  • [ ] Difference in "nomatch" default in binary search based subsets.
  • [ ] replacing NAs with binary search based subsets possible?

    FAQ (most appropriate here, I think).

  • [x] Update FAQ with issue on external pointer being NULL when reading an R object from file, for example, using readRDS(). Update this SO post.

  • [ ] Explain with example, on over allocating the data.table using alloc.col(), and when to use it (when you need to create multiple columns), and why. Update this SO post.
documentation internals

All 54 comments

fread is a least worth to mention.
Above points are related mainly to data transformation, fread is more a data extraction so it might be skipped in such vignette, yet IMO it is worth to mention such data.table capabilities.

edit: which one you are going to use: Rnw or Rmd?

Agreed, and updated.

I'm curious about what makes a cold by faster than say tapply. One part of the answer is gforce, but what about user written functions? I could not find anything about this. There's a nice post about panda : http://wesmckinney.com/blog/?p=489
One could even compare it with sapply. For instance, suppose I start from a list of vectors. Is it ever worth it to append all the vectors in one column in a data.table and use by instead of sapply ?

@matthieugomez interesting question! Would be nice to cover this as well. Keep'em coming :-).

I would be interested to learn about IDateTime and some of the use cases for it.

@gsee updated.

Being new to R and data.table (since March), I would say that there needs to be a basic outcome-oriented introduction as opposed to the current function-oriented one. In other words, it is one thing to read what each parameter in data.table does, but they often make little sense without having a use-case in mind. While there are examples of output, many people need to go the other direction. That is, they know what output they need, but they don't know what function/parameter/setting is most appropriate to use. It would be helpful to have a simple recipe approach to get them started.

How to I create subsets of my data?
How do I do an operation on subsets of my data to create a new or updated data set?
How do I add a new column?
How do I delete a column?
How do I create a single variable?
How do I create multiple variables?
How do I do different operations on different subsets of my data? (.BY)
How do I use data.table in a function and pass in data.table names and columns on which to operate?
How do I do multiple sequential operations on the same data.table?
Can I select a subset of data and do an operation on it at the same time?
When do I need to be careful about creating/updating variables by reference?
How do I select one observation per group (first, last)?
How do I set a key and how is it different from setting an index?
Under what conditions does my key get deleted when I do an operation on my data.table?
Can I just use the regular "merge" syntax or do I need to use data.table syntax (Y[X])?
How do I collapse a list of lists into one big data.table? What if the columns are in different order?

There are probably a ton of other items all on SO that could be edited into a simple compilation of questions and answers.

@markdanese Thanks for your suggestions. These are all great to have, but probably as a separate wiki, as they're very particular about certain tasks. The objective of the vignette is to introduce to data.table syntax illustrating how flexible and powerful it can be, so that you are able to do these tasks yourself.

I'm writing the vignettes now (as fast as I can), and the format is more or less in this fashion (Q&A) and explaining the answer with an example. Once I've the first vignette polished, I intend to post it here to get some feedback.. It'd be great to know what you think as well.

Thanks again.

Further extension of the idea of the wiki page: The FAQ and Code Fragments (Advanced) links listed on http://www.ats.ucla.edu/stat/r/ might be a useful resource for contrasting traditional tasks in R with data.table way. I did something like this in a blog post (http://vijaylulla.com/wp/2014/11/12/grouping-in-r-using-data-table/) to show it to my colleague. Sorry for the shameless self promotion.

I've finished the _Introduction to data.table_ vignette (see link on top). It'd be great to know what you think.

Thanks to @jangorecki and @brodieG for great feedbacks, and of course @mattdowle :-).

This is really great. Wish it existed a year ago when I started using data.table. A couple of small things below for your consideration:
You might want to mention that you can sort (via order) in i in your summary at the end. You could also want to mention this at the beginning. You could also mention that there are more sophisticated joins that can be done in i involving keys that are not covered. That allows you to mention the main functions of i so the reader can look for more advanced functions if they need them. And you can hyperlink to them later.

In the .SD section you write "that group" but it might be more clear to say "that group defined using by". This is also done a little later as well.

I might have missed it, but it would be good to be a little more clear that .SD with by essentially limits the data to the .SD columns and then creates a set of data.tables for each unique combination of the variables in the by. It then processes these data.tables in the order of the by variables using the function(s) from j. You could even mention that there are special symbols that allow users to access some of the indexes generated as part of that processing, but that these are beyond the scope of the introduction vignette.

Again, these are just suggestions. Your hard work (and Matt's) is greatly appreciated.

On Sat, Jan 17, 2015 at 6:22 PM, Mark Danese [email protected]
wrote:

This is really great. Wish it existed a year ago when I started using
data.table. A couple of small things below for your consideration:

Thank you.

You might want to mention that you can sort (via order) in i in your
summary at the end. You could also want to mention this at the beginning.

Oh snap! Great point. I should add "order(..)" at the very beginning, and
will add to summary as well.

You could also mention that there are more sophisticated joins that can be
done in i involving keys that are not covered. That allows you to mention
the main functions of i so the reader can look for more advanced
functions if they need them.

Right, will do.

And you can hyperlink to them later.

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well
on the WIKI..

In the .SD section you write "that group" but it might be more clear to
say "that group defined using by". This is also done a little later as
well.

I thought I edited it out to "the current group", but apparently not.. "by
the current group, defined using in by" - how does that sound?

I might have missed it, but it would be good to be a little more clear
that .SD with by essentially limits the data to the .SD columns and then
creates a set of data.tables for each unique combination of the variables
in the by. It then processes these data.tables in the order of the by
variables using the function(s) from j.

I think you missed it. It is right underneath the block quote where .SD is
explained (in section 2e). And it explains exactly what you mention here...

You could even mention that there are special symbols that allow users to
access some of the indexes generated as part of that processing, but that
these are beyond the scope of the introduction vignette.

Right.. that's the reason for not introducing other special symbols.

Again, these are just suggestions. Your hard work (and Matt's) is greatly
appreciated.

Great suggestions. I'll write back once I've the other vignettes uploaded.


Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/944#issuecomment-70375167
.

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well
on the WIKI..

AFAIK when you push package to CRAN which includes Rmd in vignettes directory they will be automatically build to check if build vignette succeed, but the source code in CRAN will contain vignettes (html) already built by you, not the one from CRAN build/check.
CRAN is a good place for vignettes as for many users it is the first place to seek for docs/tutorials so I think it is worth to have them in CRAN.

And you can hyperlink to them later.

That, I'm not sure.. as these are meant to be pushed to CRAN, and as well on the WIKI..

Don't single folders links work on CRAN? I haven't actually put anything up there, but this vignette links multiple others in the same folder by using relative links and works fine from R (obviously the link is not from R, but if you install the package and run the vignettes the links work.

Updated with _Reference Semantics_ vignette.

thanks again for doing all of this.

just one other suggestion on something to cover on a vignette -- using data.table inside your own function. not writing a package, but just trying to automate some common tasks. there are some tricks that I have not quite figured out. also if there is a post somewhere on this topic, a link would be appreciated.

finally, a vignette listing "useful" stack overflow posts might be helpful for topics you don't want to include in a vignette.

just some random thoughts.

Two thoughts :

  • Link to the vignettes in the wiki
  • In the reference semantics vignette, add how to use := with a quoted list expression (or just a quoted assignment). Maybe this deserves its own vignette, as the NSE (non standard evaluation) in data.table eases interactive use but requires that for using data.table in your own function or package you should now something about quote, eval, substitute and friends. Maybe just add something like dt[, do.call(":=", eval(my_quoted_list)] to the vignette and then create a vignette on NSE and its implications?

Thanks.

  1. Have you seen this?
  2. That'll most likely be covered in a separate vignette. But no plans yet.

@arunsrinivasan Nope, I hadn't seen that, great! Another bookmark

Updated with _Keys and fast binary search based subset_ vignette.

Very nice. I love these vignettes. Just some quick comments for consideration.

What is the purpose of taking over row names if they are not used? Or are they used by the special operators in j (like .N, .I, etc.)? I think they are used by data.table, but just not as indices. I have always been confused by the purpose of forcing the numbered row names.

Why use unique in the first key when accessing only the second? If you don't, you get a lot of repeated rows in the output, right? Maybe obvious, but it might be helpful to say/show what happens if you don't.

Do all keys need to be quoted? Even numeric (integer) ones? Can you use a numeric as a key? Any things to watch out for?

What if your key column has NA in it? Can you search for those and replace them (as you did in your example where your replaced 24 with 0?

It might help to explain that keyby applies to the _output_ data.table (ans in your example) and not the input data.table (flights in your example).

Can you pass a vector to the key? In other words, can you create airport <- c("LGA", "JFK", "EWR") and use airport directly in i in your example near the bottom? This might help set up the idea of passing a different data.table in for a merge.

Typo on "corresponding" ("correspondong"). One of the back ticks is missing in the vector scan section where you writing "The row indices corresponding to origin == "LGA" anddest == “TPA”` are obtained using key based subset."

@markdanese regarding the

Why use unique in the first key when accessing only the second?

flights[.(unique(origin), "MIA")]

Not sure if you very asking to suggest better explanation or you are not aware of more complex usage of multiple column key.
You cannot simply use binary search on dest when your key is c(origin, dest), you should have c(dest, origin) to use binary search on dest. Using .(unique(origin), "MIA") uses binary search, by providing all available values for the first column in key and then selective values to second column.
I've made an extension to use only selective columns from key. looking at the simple example may also help you to understand. My extension is not ready to be PR to data.table master as the memory usage does not scale as good as it could if developed using internal data.table functions / combined with data.table secondary key.

Can you use a numeric as a key?

You can use numeric as key, it is mentioned in Keys and their properties section.

Any things to watch out for?

Not sure but setNumericRounding affects the numeric key, might be worth to mention in the vignette.

What if your key column has NA in it? Can you search for those and replace

Yes, the is.na() is optimized to use binary search. Try data.table(a=c(1,NA_real_),b=c("a","b"),key="a")[.(NA_real_), .SD ,verbose=TRUE]

Also to @arunsrinivasan, the typo in:

find the matching vlaues in

Thanks Jan -- that is really helpful. I offered those questions as things that could briefly be mentioned in the vignette to help new users understand what is going on. They were things that came to mind (as a fairly new user) while reading the documentation. I can't really contribute to the code, so I am hoping to contribute by helping with the documentation.

On Fri, Jan 23, 2015 at 8:48 PM, Mark Danese [email protected]
wrote:

Very nice. I love these vignettes. Just some quick comments for
consideration.

What is the purpose of taking over row names if they are not used? Or are
they used by the special operators in j (like .N, .I, etc.)? I think they
are used by data.table, but just not as indices. I have always been
confused by the purpose of forcing the numbered row names.

Section 1a, just above Keys and their properties has answer to this.
Data.tables _inherit_ from data.frames.

Why use unique in the first key when accessing only the second? If you
don't, you get a lot of repeated rows in the output, right? Maybe obvious,
but it might be helpful to say/show what happens if you don't.

Again, this is explained exactly underneath in "what's happening here?". I
even refer to the previous section where I lay the groundwork for
explaining this one.

Do all keys need to be quoted? Even numeric (integer) ones? Can you use a
numeric as a key? Any things to watch out for?

There's an example with integer columns on 2d. I thought that was
sufficient?

What if your key column has NA in it? Can you search for those and replace
them (as you did in your example where your replaced 24 with 0?

Good point. That's a difference with vector scan. Will try to add this.

It might help to explain that keyby applies to the _output_ data.table (
ans in your example) and not the input data.table (flights in your
example).

'keyby' was already discussed in the first vignette. But I'll see if this
can be added.

Can you pass a vector to the key? In other words, can you create airport
<- c("LGA", "JFK", "EWR")and useairportdirectly ini` in your example near
the bottom? This might help set up the idea of passing a different
data.table in for a merge.

Content for next section. That is how we transition into joins.

Typo on "corresponding" ("correspondong"). One of the back ticks is
missing in the vector scan section where you writing "The row indices
corresponding to origin == "LGA" anddest == “TPA”` are obtained using key
based subset."

Thanks.


Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/944#issuecomment-71253738
.

Great work on these vignettes!
My comments may be late or already covered:

  • I would like to see a variety of ways / examples of using dynamic rows and columns.
  • More extensive comparison on merge and joins.
  • Different / richer ways to use set. Also, it would be nice to see an explanation why the following does give an error (see here ):
for (j in  valCols)
   set(dt_,  
    i = which(is.na(dt_[[j]])),
    j = j, 
    value= as.numeric(originTable[[j]]))

Excellent functionality and vignette! Thanks Arun

On Tue, Jun 23, 2015, 21:02 Arun [email protected] wrote:

Added Reshape vignette
https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reshape.html
to Wiki https://github.com/Rdatatable/data.table/wiki/Getting-started.


Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/944#issuecomment-114678716
.

man for patterns would be good. Great vignette

Isn't reshape2 required to be loaded to use these commands? If so, then that should be mentioned. I really like the focus on "wide to long" and "long to wide". I absolutely hate the syntax of reshape2 (for example, I think "make_wide" is much more clear than "dcast"). For this reason, I would not write the section headers as "melting data.tables" and "casting data.tables". That only make sense for people who are familiar with the reshape2 package. I might begin with headers that are more universal as above ("long to wide").

For what it is worth, I can't get the first line of the vignette to run using a fresh R session with just data.table loaded. I have no idea why (maybe mode should be "w" and not "wb"), but
DT = fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/data/melt_default.csv")
returns
Error in download.file(input, tt, mode = "wb") : unsupported URL scheme

As always, thanks for doing this. It is really useful.

@markdanese thanks for the excellent feedback.

  1. reshape2 won't be required from data.table v1.9.6. Updated this in the vignette as well.
  2. Added 'wide to long' and 'long to wide' to titles, and other places to avoid confusion to people who are new to this topic.
  3. https functionality in fread is implemented in the devel version. So you won't be able to run that code yet with v1.9.4. Either update, or wait a bit :-).

Thanks for your encouragement.

@jangorecki patterns() won't be exported. The usage will be expanded for [.data.table to be used for selecting columns, :=, .SDcols etc..

@arunsrinivasan still the manual for patterns may help, the same way there is one for :=. Just because many people (I think) use ?fun to understand the code they read.

In the _join_ vignette it may be worth to add corresponding SQL examples of data.table joins so it can be easier to pickup for db guys.
Examples of corresponding SQL statement can be found for example in SO How to join (merge) data frames (inner, outer, left, right)?.

Would also be cool to have some "Refugees" vignettes --

  • data.table for Stata users
  • data.table for SQL users
  • data.table for Matlab users
  • data.table for Python/pandas users
  • even data.table for dplyr users

etc. Like a quick-start guide, but oriented towards emigrees.

Added Secondary indices and auto indexing vignette. This should allow smooth transition from subsets to joins for the next vignette I'll work on.

@arunsrinivasan isn't more appropriate to not use _secondary_ in relation to _indices_? it was used for _keys_ where it was important. Now seems to be redundant once we switch to _index_ naming.

@jangorecki I think "secondary" is useful for its relation to keys (primary), perhaps:

Secondary sorting

Is a better description?

but already the _index_ word has been used, it looks nicer than _secondary sorting_ :)

So you would just name it "auto indexing"? IMO "secondary sorting and auto indexing" feels more informative

_auto_ can be somehow misleading, as indexes should works for _auto_ creating index, and also for use of manually created indexes - #1422 address current limitation in that matter.

I see. I'm still missing your preferred alternative -- just "Indices"?

not perfect but preferred over _secondary indices_

I like this latest vignette a lot. My only thought was that it might be helpful to mention what types of operations cause the index to be dropped. From my testing, it seems pretty much anything that changes the number of rows, or any operation involving the indexed column.

I thought the examples of "on" were really helpful.

@markdanese good point, will add.

Thank you for the updated vignettes with the release of v1.9.8.
The "Reference semantics" refers to the copy() function and its new capabilities to make shallow copies (especially inside functions, something that I am really interested in):

"However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette."

But the design vignette is missing and the link points to an old issue. The reference manual does not provide more information on copy() than the one provided in the vignette. The rest of the vignettes do not provide any information on copy.

Will this vignette become available soon?

+1 for internals vignette. I (and I guess a few others) am quite interested in contributing a bit on the C side of things, but am a bit intimidated by the (as it stands) 35k lines of C code... quite the learning curve to 'go it alone' -- an intro to internals could do wonders!

Wanted to chime in and ask if contributions to the vignette are accepted from non-code contributors (like me). I am particularly interested in contributing to the joins vignette as I had quite a bit of trouble with it initially and was guided to solutions from Arun's answers on Stackoverflow, and I'd like some guidance on how to do so, if allowed.

@arunsrinivasan I see that you have a point IDateTime vignette. Perhaps it could be included in the more general vignette suggested by @jangorecki: vignettes: timeseries - ordered observations?

In addition, I am preparing a first draft on some of the topics suggested by jan. Perhaps parts of it may be relevant for a join vignette as well? I'm happy to share if anyone may find it useful.

@zeomal such a contribution would be highly valuable and much appreciated!

@MichaelChirico, thank you. @Henrik-P, will your brief on normal joins be comprehensive - i.e. will your focus be more on timeseries? If not, I can start work on it - I haven't used rolling joins yet, so no knowledge there. :)

@zeomal Hopefully I will be able to upload the first draft soon, so you can have a look at it. In my draft, I provide a simple example of a "normal" join on a single variable, time, where there are non-matching rows. I use nomatch = NA. (maaaybe also a quick example with nomatch = NULL)

My idea was that this simple join could provide a context and a feeling for the problem, which I then treat more thoroughly in the following sections on rolling and non-equi joins et al.

Thanks a lot for your willingness to contribute! .

I have a question on joining by reference, while preparing the vignettes. The X[Y, new_col := old_col] performs something similar to a traditional left join on X. However, if there are multiple matches to Y's keys in X, only the last (or first?) matching value of the key is retained. Is this explicitly documented somewhere? I had tried searching for this back when I encountered it, but had to resort to my understanding of updating by reference for the reason. For a reproducible example,

> X = data.table(a = c(1, 2, 3), m = c("a", "b", "c"))
> Y = data.table(b = c(1, 1, 4), n = c("x", "y", "z"))
> X[Y, new_col := i.n, on = "a == b"]
   a m new_col
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

# an ideal left join - expected behaviour per a new user, given below
# not possible because updating row by reference isn't implemented
   a m new_col
1: 1 a       x
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

This is expected behaviour, but isn't exactly straightforward for a new user. mult does not impact the output either. Any suggestions on how I document this? Add merge as a workaround for a proper left join?

@zeomal please post your future question about join vignette in #2181 issue instead. It seems to better place. It is documented in set.

@zeomal If you wish to check how brief my treatment on normal (equi) joins is, I just want to let you know that I posted a PR on a timeseries vignette.

Was this page helpful?
0 / 5 - 0 ratings