Pandas: Request for DataFrame.to_tsv() for reading tab delimited text

Created on 11 Jun 2015  ·  16Comments  ·  Source: pandas-dev/pandas

I propose a function, which can be called on a DataFrame, named to_tsv or to_table. The function is the equivalent of to_csv() with the argument sep='\t'. While to_tsv() contains the functionality to write tsv files, I find it annoying to always have to specify an additional argument. I prefer tsv files to csv files because tabs more rarely occur and therefore decrease the need for escaping. I also find the plain-text rendering more readable. I worry that the lack of a dedicated to_tsv() function encourages the use of csv over tsv. Currently read_table() defaults to tab separators, but there is no equivalent function for writing.

IO CSV

Most helpful comment

+1. As practitioner, I would highly appreciate to_tab sugar.

All 16 comments

In addition to being just to_csv(sep='\t'), a to_tsv function should consider changing the default quoting, since quoting is less necessary for tsv files.

The pandas API is already cluttered with an excess of rarely used convenience methods. I really don't think adding another one is a good idea.

I agree with @shoyer here. All functionality is there to do this within to_csv, and given we already have many methods, I think the reason to add a new one should be stronger than being able to provide other defaults.

I am closing this (we have too many open issues ..), but discussion can certainly continue if needed.

+1. As practitioner, I would highly appreciate to_tab sugar.

I think the reason to add a new one should be stronger than being able to provide other defaults.

IMO convenience is worthy justification (people tend to write many text files, so to_csv has to constantly be supplemented with parameters).

However, my main motivation is a disdain for the CSV format. It pains me to see people still using CSV over TSV. Obviously excel/database support has a role to play. But a project like pandas should strive to make the best practices the easiest to implement.

though this not a major issue for me currently, csv is US/Commonwealth representation-centric and internationally unaware. With all Pythonic philosophy of acceptance of UTF and internationalization, tab-separated must be preferred over csv / semicolon-sv.

While I can understand the sentiment put forward by @shoyer, I agree with @dhimmel. It is my experience that TSV is much more of a standard format for data analysis than CSV. There are many use cases where the TSV format is a requirement, whereas I am not familiar with any for CSV format (there a couple examples of common usages here). TSV also has an advantage in that the raw text is easily readable, and avoids the issues with quoting as mentioned by @dhimmel.

I am only slightly opposed to adding to_tsv. In my experience (in the US) CSV is more common than TSV (at least at the name for the file format), but only slightly. The main virtue to_tsv has going for it is that the name makes it instantly clear what it does.

CSV and TSV are both well supported and widely used in data science. CSV is more of a legacy format, thus many backwards-focused projects default to CSV. However, I think forwards-focused projects should default to TSV, as it's better for data science. Since there is no default to_text_delimited_file output function in pandas, to_csv is the de facto default. Since most users don't care enough to manually specify sep='\t', pandas is contributing to the prevalence of CSVs over TSVs and delaying the rise of the superior format.

Please excuse my ignorance on the matter but apart from being easier to read as a human, if and only if the column headers have roughly the same characters as their corresponding data which is not always the case, what advantages does TSV provide over CSV? Honestly curious if there is a performance difference between the two, I use TSV right now but honestly only because the data files I am working with came in that format so I left them in the same format.

what advantages does TSV provide over CSV?

@Starkiller4011 tabs are a more natural separator for columnar data. They require less quoting, since values rarely contain tabs but often contain commas.

Honestly curious if there is a performance difference between the two

I'd expect the performance difference is trivial. However, like most things in data science, the real type of performance that matters is programmer efficiency. And I think TSVs are nicer to work with than CSVs.

Not everyone agrees that tab separation is superior to csv -- I don't, for example.

As Python programmers, we know that whitespace isn't always preserved across different operations, like copying and pasting. Those of us who answer a lot of questions on SO, for example, regularly have to use sep="\s\s+" to parse text which people have dumped in a whitespace-separated format, and we have to hope they've put enough spaces between columns for that to work. If they were using commas, or semicolons, or pipes, or something, this wouldn't be a problem. (And I just thought of carats, which used to be used pretty widely in some fields.)

If we want to add a to_tsv alias to make some people happier, okay. But let's not pretend that TSV doesn't have its own headaches when you're working with it, and the only advantage I can think of is less quoting.

I think it's worth taking a step back and recognizing that a function like to_csv is kind of silly, the solution should be a more generic to_table function which requires a delimiter to be specified, and which to_csv is just a convenience wrapper around. R has this functionality in it's write.table() function, which makes more sense.

For the record, I think both CSV and TSV and acceptable and good formats. They should both be supported. @dsm054 brings up some compelling advantages to non-whitespace delimiters.

A bigger issue in my opinion is using the .csv extension indiscriminately (e.g. when referring to TSVs). See discussion at https://github.com/pandas-dev/pandas/pull/14587. I agree with @stevekm that to_table should be a generic function where you should specify your delimiter, while to_csv or to_tsv should focus on those standards. Going about this in a backwards compatible would take some forethought. But at least pandas 2 should consider function names along the lines of readr.

Just starting to use pandas dataframes coming from R + tidyverse/readr and first thing I was negatively impressed by is the lack of consistent read/write methods like:

read_csv()/write_csv(): comma separated (CSV) files
read_tsv()/write_tsv(): tab separated files
read_delim()/write_delim(): general delimited files
read_fwf()/write_fwf(): fixed width files
read_table()/write_table(): tabular files where colums are separated by white-space.
read_log()/write_log(): web log files

In 20 years doing data science in genomics I never encountered a csv file, most data exists in tsv (or white-space delimited) format. Having to specify sep and quoting argument using df.to_csv() to write a tsv (or white-space delimited) file is inconvenient to say the least.

Having df.read_tsv() df.to_tsv() for tab-delimited files and df.read_table() df.to_table() for white-space delimited files would be very helpful for people coming to pandas from R.

As of pandas 0.24, read_table is now deprecated (see https://github.com/pandas-dev/pandas/issues/21948 / https://github.com/pandas-dev/pandas/pull/21954). Since I've been using read_table as a substitute for the lack of read_tsv, I am now getting many:

FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.

On the plus side, removing read_table does make it more straightforward to add both read_tsv and to_tsv functions, although is the tide turning against convenience functions as per https://github.com/pandas-dev/pandas/issues/18262?

Was this page helpful?
0 / 5 - 0 ratings