Pytorch: [RFC] Memory format (aka layout aka NHWC) support

Created on 10 Apr 2019 · 68Comments · Source: pytorch/pytorch

Problem statement

CNN operators utilize canonical order of tensor dimensions and assign them semantic meaning. For the 2D case in PyTorch today an input to torch.nn.Conv2d has to be a 4d tensor in NCHW order - .

For performance reasons, it's often beneficial to reorder dimensions differently so that memory accessed by particular operations is laid out contiguously and locality is better utilized. Most common option is moving dimensions towards the end - NHWC. There can be even more complex memory formats that tile one dimension into blocks, e.g. .

Example libraries utilizing it include:

cudnn has faster performance on Volta in NHWC
fbgemm and qnnpack don't support NCHW.
libxsmm does support NCHW but the performance penalty is something like 50% (IIRC).

The challenge is that transforming the dimension order itself is expensive, so in cases when multiple CNNs operations are performed in a row (e.g. conv(relu(conv)))) it's beneficial to transform to the different memory format once, carry out operations and reorder them back.

Thus it's important to make PyTorch aware of different dimensions orders and be able to pass tensors with different memory formats between operations both in eager and JIT mode. Furthermore, it's beneficial to have automatic JIT optimization passes that try to apply heuristics or search techniques to figure out whether changing memory format is beneficial perf-wise and where in the model it makes sense to do it.

We strive to build API capable of representing:

Tensor with different memory format (at the beginning, just dimension order) present in PyTorch in Eager and JIT. Blocked layouts are lower priority but still nice.
User-exposed APIs for querying and changing memory format
Core CNN operations being able to handle input tensors with different memory format and routing to corresponding faster implementation
Ability to infer and optimize about memory formats in JIT passes

Terminology: the problem above is often referred to as “layout” (mxnet), “data_format” (tf), “image_format” (keras), “order” (caffe2). We propose to utilize name “memory format” or “memory_format” in PyTorch. The name “layout” is unfortunately taken in PyTorch with values 'strided' vs 'sparse_coo', so that option of naming is not available.

Affected operators

Following operators at minimum should be memory-format-aware. In addition to producing the correct result, they need to deliver best performance from underlying libraries AND preserve memory format of outputs in order to propagate explicitly specified user intent.

convolution
different kinds of pooling
batch norm, layer norm, instance norm (generally, whatever norms)
upsampling/interpolation
feature dropout
softmax to a lesser degree - dimension can be manually specified there, but efficient implementations are present only for implicit nchw layout
padding
element-wise (unary and binary) operations
constructors of tensors that inherit memory format, e.g. empty_like.

API and Behavior Changes

Define concept of memory format in PyTorch:

Constants like torch.memory_format.channels_first. They don't have specified type and can be arbitrary comparable objects (likely start with enum but in future might be other objects to interop with concept of named tensor)
- Alternative: use torch.channels_first directly
Values are channels_first and channels_last (to allow for fewer constants)
For 1D images / 3D tensors the values mean NCW, NWC, for 2D images / 4D tensors - NCHW, NHWC, for 3D images / 5D tensors - NCDHW, NDHWC

Add following methods to Tensor:

x.is_contiguous(torch.memory_format.channels_first)
x.to(memory_format=torch.memory_format.channels_first)

Note: there's no x.get_memory_format() function for now, only explicit checks - it allows wider range of possible implementations. We might want to add it though.

Tensor semantical layout always stay the same - NCHW! x.size() always returns (n,c,h,w)

Operations preserve memory format behavior:

convolution, pooling, etc, (see above) return output in the same memory format as the input and internally dispatch to the best implementation
unary element-wise operations preserve same memory format and need to run as fast as on contiguous tensor
binary element-wise operations provide some reasonable guarantees on preserving memory format - likely can be defined broader but minimum is:
- NHWC + scalar → NHWC
- NHWC + column vector → NHWC
backward operations for core CNN ops preserve the same memory format as in forward path. (it might be needed to be enforced explicitly because incoming gradients for the output can be in different memory format)

Memory format is a property of a tensor that is preserved through serialization/deserialization (in case the tensor is a parameter).

Strided implementation

Tensor in PyTorch today have concept of strides that specify how logical tensor is laid out in memory. Specifically each tensor has a strides vector of the same length as sizes. In order to index elements in logical indexing (i1, i2, .., ik) one does dot product with strides and looks up memory at offset + i0*stride0 + i1*stride1 + ... * ik * stridek. Contiguous tensors thus have strides which are reversed cumulative products of sizes. For example 4D tensor with sizes (n,c,h,w) has strides (c*h*w, h*w, w, 1).

Strides can be used to represent different memory formats (that are dimension re-ordering) physically while preserving logical default NCHW order. It gives effective definition of memory format transformation as:

# implementation of x.to(channels_last)
def to_mem_format_nhwc(x):
    return x.permute(0,2,3,1).contiguous().permute(0,3,1,2)

# implementation of x.to(channels_first)
def to_mem_format_nchw(x):
    return x.contiguous()

In NHWC format the strides vector is (c*h*w, 1, c*w, c). Thus in memory buffer the weights are in contiguous order for NHWC.

Strides can be used for testing:

def is_nhwc_contiguous(x):
    return x.permute(0,2,3,1).is_contiguous()

# or alteratively
def is_nhwc_contiguous(x):
    n,c,h,w = x.size() # in any case the sizes remain in NCHW order
    return x.stride() == (c*h*w, 1, c*w, c)

def is_nchw_contiguous(x):
    return x.is_contiguous()


# operator implementations can just check contiguity and carry on directly on data pointer
def my_sample_op(x):
    if x.is_contiguous(nhwc):
        float* p = x.data();
        # Do we need to go to c++ here? 
        # can we have an example in python?
        n,c,h,w = x.size()
        # operate on `p` as it's guaranteed to be (n,h,w,c) array
        y=my_nhwc_op(p)
        # Do we need to convert the layout of y?

    else:
        # Need to convert x to nhwc layout
        x = x.permute(0,2,3,1).contiguous()
        float *p = x.data();
        # Is this needed?
        y = my_nhwc_op(p)
        return y.permute(0,3,1,2).contiguous()

Pros of this approach:

Utilizes existing PyTorch concept of strides without adding new top-level ideas or API parameters
Preserves logical behavior of tensor in canonical NCHW order
Works for arbitrary reordering of input dimensions
Existing serialization routines already preserves strides of tensor
Ability to reuse many operations to work on different memory layout

Cons:

Calling .contiguous() is equivalent to switching to NCHW and may occur by accident from user or inside one of the ops
- Explicit audit of operators is needed to ensure they preserve memory format
Doesn't work for blocked / tiled formats - a different approach is needed
- It's possible to consider having adding them as first class citizen in PyTorch, but it's a much bigger change
- Alternative is to treat them as opaque handles, e.g. MKLDNN tensors
Performance characteristics of underlying implementations are less obvious to the end user

Biggest potential problem is with unclear user intent. There's no way to distinguish whether user really wanted different memory format or input tensor just happened to be strided this way. Specifically, it leads to behavior change for the existing operations - today convolution can only produce NCHW-contiguous tensors even if the input is arbitrary strided, in a new world it might recognize the input as NHWC and thus would return NHWC too. It doesn't change semantics but leads to hard-to-debug performance issues. Possible solution might be to tag tensors explicitly with user-specified memory_format flag and only follow this annotation (in addition to strides).

To solve above issue, initial proposal is to introduce “soft” memory format tag on tensor that record the last to(memory_format) call done on tensor. Operators would need to propagate this annotation to the outputs. Annotation is “soft”, so we won't hard-error on mismatching annotations but rather produce warnings in profiling mode.

Operator implementations

Signature of existing operators doesn't change. Operators can do hard-coded dispatch inside the operator to route to faster implementation. If implementation is not available, round-tripping through different memory format is possible. Alternative would be raising an error message.

def maxpool(x: Tensor):
    if x.is_contiguous(torch.layout.NHWC):
        return max_pool_impl_nhwc(x)
    return max_pool_impl_default(x.contiguous())

It's preferred to use a single symbol like 'conv' to refer to the operators in JIT IR instead of creating a separate operators like 'conv_nhwc'. The reason for it is simplicity and keeping IR at the level of semantical representation.

Element-wise operations

We have to ensure that core operations like element-wise preserve memory format and are efficient.

Unary operations can be generically handled by verifying whether a block of memory is “dense” - i.e. whether elements span an area without gaps and each memory location is used exactly once. It can be verified with simple algorithm

def is_dense_format(x):
    p = 1
    for s, d in sorted(zip(x.stride(), x.size())):
        if s != p:
            return False
        p *= d
    return True

def my_unary(x):
    if is_dense_format(x):
        return contig_memory_impl(x.data(), x.numel())
    return default_strided_impl(x)

# is_dense_format can be used in implementations of e.g. empty_like too

Performance tooling

For debugging performance we should add support to the profiler for:

seeing where in the program actual memory reorderings occur - i.e. track calls to .contiguous()
tracking which implementation is invoked
issue warnings on memory format changes in e.g. binary ops (where “soft” annotation is useful)

This functionality can be built into an on-demand profiling tool.

Autograd handling

It's logical to expect that backwards pass should run with the same memory format as forward. It won't always happen automatically as incoming gradients might be arbitrary strided. Thus forward pass has to explicitly recognize memory format, store it in autograd closure and apply to the grad tensor before the backwards function.

Possible implementation:

def conv_backward(input, weight, grad_output, grad_weight, grad_input):
  if input.is_contiguous(torch.memory_format.channels_last):
    grad_output = grad_output.to(torch.memory_format.channels_last)
    return conv_backward_nhwc(...)
  else:
    grad_output = grad_output.contiguous()
    return conv_backward_nchw(...)

Representation in JIT

Current proposal is to have:

No first-class handling for memory format in type annotations just yet. Instead, we can maintain a lookaside map in necessary shape for passes that manipulate memory format
Inference pass (similar to shape_inference) that produces per-Value format annotations
Memory format transformation passes (manual or automatic) that find where necessary to(memory_format) calls need to be inserted for optimal performance

For enforcement purposes, we can also utilize statements like assert x.is_contiguous(channels_last).

Note: There's a question of where to store information that particular device has a preferred memory format combination (for example qconv on x86 routes to fbgemm that implements NHWC only). One option is to put it in op registration level, however, memory format annotation feels like more of a side information. We can start by maintaining a global map somewhere in JIT pass that denotes preferred memory formats and associated heuristics. If it gets untidy - we can switch to registration-based mechanism.

Beyond: blocked layouts

As we decide to add more complex packings of tensors, using first-class PyTorch tensor for it might not be plausible because of high implementation cost and complexity. Two alternatives are possible:

Opaque representations like custom C type bindings. This is an option to choose for packing in inference where diversity is higher in terms of perf optimizations
First-class tensor type like MKLDNNTensor with some (but not all) of the operations bound on this new type

Yet another alternative is to implement native support for blocking/tiling in core PyTorch Tensor class.

Named tensor relation

Existing proposal for NamedTensor is structured as a type-checking mechanism on tensors - at the moment it doesn't assign any semantic meaning to dimension names. Thus the only way to infer meaning of the activation tensor is to continue using predetermined NCHW format. It makes NamedTensor and the current proposals orthogonal.

If we're willing to hard-specify meanings of some names (like “channels”, “widths”), operators can utilize this information to route to faster implementation. It'd be a semantic change though as the input tensors would logically have NHWC (not NCHW as today) memory format.

Prior art

TensorFlow supports both NHWC and NCHW at the operator level, via the data_format parameter; acceptable values are (“NHWC”, “NCHW”) for 4-d inputs, (“NDHWC”, “NCDHW”) for 5-d inputs, or channels_first / channels_last independent of input dimensionality. It is up to the user to handle setting the parameter correctly, i.e. it is not tracked automatically by the tensor.

Caffe2 calls this parameter is called order rather than data_format, but it's still applied at individual operator level explicitly.

Appendix: Other options considered

Litmus question: what does the following code print: tensor_in_nhwc_layout.size(1) - the number of channels (because default is NCHW in PyTorch) or height (because that's what is in NHWC layout at position 1).

Based on this answer several options are possible:

Option A - Strides (presented above). Tensor layout is a completely internal representation. Implementation-like it's most conveniently done with strides.
- .size(1) returns me “channels”, but internal memory is laid out differently
- pro: doesn't change code of the model, my model can still do dimension arithmetic directly. In fact none of the public API changes
- cons: in strides implementation many operators call .contiguous() and can accidentally revert the layout back
- cons: From a user perspective, understanding what the guarantees of the op return are paramount. This IMO eliminates strides-only approaches, because it becomes very difficult to understand the format they your op will be returned in, and there's no API to say “ignore my strides, actually just return the NCHW-contiguous thing.” This is in addition to the limitations above.
Option B - Explicit NHWC tensor. User explicitly manipulates tensor that has different dimension order but tensor itself doesn't know anything about it. We'd need some annotation on operator level to figure out what user expects.
- .size(1) returns “height”
- pro: no magic and very predictable
- cons: changing model from one layout to another becomes a complex operation that needs to track all accesses to .size() and .reshape() (or you need to make it explicit in the API?)
Option B' - Explicit NHWC tensor with layout flag. Same as above, but we allow to attach annotation to the tensor to mark it's semantic layout that ops consume in their implementation. There's no need in operator level annotation then - an operator can do dispatch based on the layout flag of the inputs.
Option C - Named Tensor. (https://docs.google.com/document/d/1ynu3wA2hcjwOtEng04N904gJjEbZWcINXO_ardX6hxc/edit#heading=h.2gbe5xpga3w9)
- .size(1) returns “height” but we ask people to NOT use this API and instead use .size('channel')
- pro: very explicit and what user wants
- con: doesn't solve the transition problem, we'd need to force all code written with layout awareness to use named tensors. If not - the same problems as above apply
Option D - Layout is opaque tensor type. Treat NHWC as we treat MKLDNN or SparseTensor - separate tensor type with different DispatchID. It's like Option A but with different tradeoffs on default behavior - non-implemented ops would fail instead of reverting to NCHW.
- .size(1) still returns “channels”
- pro: no magic and explicit, separate dispatch allows ops to decide what they want
- pro/cons: all necessary operators need to be implemented on different layout, if some op is missing, user would get an explicit error that it's not supported
- cons: we probably would need to ban many operations on it, e.g. views because expected results are hard to predict

internals mkldnn triaged

Source

dzhulgakov

👍9 ❤2

Most helpful comment

BTW why do we have to create a new concept instead of just sticking to layout? I don't think that sparse representations have a well defined concept of a layout like "channels_last", so we don't need to represent a product of memory_formats * layouts (layouts refers to the current usage), but only memory_format + layouts meaning that it should be fine to use the same argument as we used to? For me it's both shorter, nicer, and will let us avoid extending signatures of factories to a thousand arguments.

apaszke on 27 Jun 2019

👍2

All 68 comments

There is one problem with empty_like; the currently defined semantics are that you drop all stride information, so, it's not possible to preserve layout and be BC.

ezyang on 11 Apr 2019

👍1

@VitalyFedyunin is signed up to implement the .contiguous() and torch.memory_layout bits

ezyang on 11 Apr 2019

One question - for a 4D tensor x with sizes (n, c, h, w)

x = torch.randn(n,c,h,w)
# x.size(): (n, c, h, w)
# x.stride(): (c*h*w, h*w, w, 1)

We have a weird permutation

y = x.permute(0, 3, 1, 2)
# y.size(): (n, w, c, h)
# y.stride(): (c*h*w, 1, h*w, w)

Now we check whether it is contiguous for NHWC format. Following your logic as below

def is_nhwc_contiguous(x):
    return x.permute(0,2,3,1).is_contiguous()

# or alternatively
def is_nhwc_contiguous(x):
    n,c,h,w = x.size() # in any case the sizes remain in NCHW order
    return x.stride() == (c*h*w, 1, c*w, c)

For both cases is_nhwc_contiguous(y) will return True?

uyongw on 13 Apr 2019

This is correct. However we can't relay only on strides as we want to avoid any conversions back and forward during copy, to, and similar operations.

VitalyFedyunin on 14 Apr 2019

What if strides has same order as memory format? Let's use 4D tensor as example. To describe a tensor, we have sizes, strides and stride_indexes:

sizes in (n, c, h, w)
strides in physical order, i.e.

strides of (n, c, h, w) if format is nchw
strides of (n, h, w, c) if format is nhwc.

stride_indexes maps strides to nchw size:

(0, 1, 2, 3) if format is nchw,
(0, 2, 3, 1) if format is nhwc.

For nchw format this is same as before. For nhwc, it will be similar.

def is_nhwc_contiguous(x):
     n,c,h,w = x.size()
     return x.stride() == (h*w*c, w*c, c, 1)

def is_nchw_contiguous(x):
    n,c,h,w = x.size()
    return x.stride() == (c*h*w, h*w, w, 1)

def is_nchw_format(x):
    return x.stride_index() == (0, 1, 2, 3) 

def is_nhwc_format(x):
    return x.stride_index == (0, 2, 3, 1)

def is_contiguous(x):
    if (is_nchw_format(x)):
        return is_nchw_contiguous(x)
    else if (is_nhwc_format(x)):
        return  is_nhwc_contiguous(x)
    else:
        warning_not_support()

# or, to use stride_index
def is_contiguous(x):
    return x.stride() == (x.size[x.stride_index[1]]*x.size[x.stride_index[2]]*x.size[x.stride_index[3]], x.size[x.stride_index[2]] * x.size[x.stride_index[3]], x.size[x.stride_index[3]], 1)

This can also be extended to support blocked format. Use nChw16c as an example,

sizes: (n, c, h, w)
block_sizes: (n, c/16, h, w, 16)
strides: strides of (n, c/16, h, w, 16)
stride_indexes: (0, 1, 2, 3, 1)  # assume blocked dimension is always in dense (i.e. on the right side of major dimension)

More details can be further explored later on.

uyongw on 15 Apr 2019

For OPs that accepts only nchw contiguous tensor, That will be some work here.

Alternatively we can also change the prototype slightly, say

def is_contiguous(format=nchw):
    ...
def contiguous(format=nchw)
    ...

Thus by default, it assumes only nchw is contiguous. In this way you don't need to rewrite those OPs, it will be reordered to nchw automatically.

uyongw on 15 Apr 2019

We strive to build API capable of representing:

Tensor with different memory format (at the beginning, just dimension order) present in PyTorch in Eager and JIT. Blocked layouts are lower priority but still nice.

User-exposed APIs for querying and changing memory format

Core CNN operations being able to handle input tensors with different memory format and routing to corresponding faster implementation

Ability to infer and optimize about memory formats in JIT passes

Great proposal! May I explicit my understanding see if it right (including proposals for MKL-DNN formats handling):

Allow me to think there were an implementation of this proposal as a "format" class. As long as it providing querying and changing API as virtual, we could do the inheritance/extensions that fit MKL-DNN complex formats. Or other methods as long as it provide a framework for handling formats, offloading those nitty details to us.

About the OPs implementation, each OP could have a preferred formats that maximum its performance and a compatible format that functional. Element-wise operator (Or more generally speaking, memory bounded OPs) suppose to have no preference. OP produce its results tensor with a "format" object, this format object guarantees query/changing semantics compatible to default pytorch expectation, as well as that it can handle specific formats if called serials of optimized functions (like conv2d(ReLU(conv2d)) case)

CaoZhongZ on 15 Apr 2019

@uyongw I want to clarify a little more about your first example. You setup the example as, "I have a NCHW tensor, which I then transposed in a weird way (so now it looks like NWCH); now I want to know if it's NHWC contiguous." But that's the wrong way of looking at it. A better formulation is, "I have an NHWC tensor, which I then transposed into a NCHW tensor."

To put it differently, there is no intrinsic meaning to the physical dimensions of a tensor (when we ignore strides). We only give meaning to them when we consider how we reference them with respect to strides.

To describe a tensor, we have sizes, strides and stride_indexes

I do think stride_indexes is a convenient way to think about the problem, but it's strictly redundant with strides, because all you're saying is "Apply this (reverse?) permutation to strides, and then treat that as the true strides.) @VitalyFedyunin and I were talking about how it might still be a good idea to cache this information in some way, because it is a pain to reconstruct the information from strides themselves. But this is out of scope for this proposal.

Thus by default, it assumes only nchw is contiguous.

Yep, that's my reading of the plan.

ezyang on 15 Apr 2019

@CaoZhongZ

Allow me to think there were an implementation of this proposal as a "format" class. As long as it providing querying and changing API as virtual, we could do the inheritance/extensions that fit MKL-DNN complex formats. Or other methods as long as it provide a framework for handling formats, offloading those nitty details to us.

I actually don't think that is an accurate description of the proposal. The memory layout support that the proposal here supports are only layouts that can be expressed through strides. Anything that is inexpressible this way (e.g., block layout) won't work this way, and has to be supported by our more heavy-weight "layout" mechanism.

ezyang on 15 Apr 2019

To put it differently, there is no intrinsic meaning to the physical dimensions of a tensor (when we ignore strides). We only give meaning to them when we consider how we reference them with respect to strides.

Partly agree:-) But not on this specific problem. Say I have already a nhwc tensor. Then I permute it to nwhc. I want to further permute to nhwc then do a contiguous(). But I got it nhwc contiguous already. Isn't it confuse?

I do think stride_indexes is a convenient way to think about the problem, but it's strictly redundant with strides, because all you're saying is "Apply this (reverse?) permutation to strides, and then treat that as the true strides.)

IMHO, it won't be redundant with strides, if you have strides in nhwc (physical). Because you need a the right mapping with sizes(logic). Otherwise there is no way to tell the real order.

BTW there is a more straightforward approach by using reverse mapping. Say, for nchw, it is (0, 1, 2, 3), for nhwc, it is (0, 3, 1, 2) instead of (0, 2, 3, 1). That says the stride_index itself is always NCHW also. But the problem is, it can not be extended to blocked formats like nChw16c or OIhw16i16o.

uyongw on 15 Apr 2019

Blocked formats require a completely different set of operators implementation; for that reason, we prefer not to mix them with 'memory format', which is by definition supposed to be friendly with all existing operators and work with same or better performance.

VitalyFedyunin on 15 Apr 2019

Partly agree:-) But not on this specific problem. Say I have already a nhwc tensor. Then I permute it to nwhc. I want to further permute to nhwc then do a contiguous(). But I got it nhwc contiguous already. Isn't it confuse?

It is hard to understand your example because you are using some terms colloquially and precision is needed. Here is how I am interpreting what you have said:

An "nhwc" tensor to be as per this proposal, "Tensor whose physical layout is NHWC, but is strided so that the logical layout is NCHW."
To "permute a (tensor whose logical layout is NCHW) tensor to (logical layout) NWHC" is to run y = x.permute(0, 2, 3, 1), since you are permuting the logical layout, not the physical layout. (I suspect this is not what you meant, because in your original post you mentioned the permutation x.permute(0, 3, 1, 2)
To then further permute a (logical layout) NWHC tensor to (logical layout) NHWC is to apply the permutation z = y.permute(0, 2, 3, 1). So now you have a tensor whose logical layout coincides with the physical layout. This means that if we ask z.contiguous() we will get true (and, confusingly, z.contiguous(memory_layout=NCHW) will be true too.) But it will NOT be NHWC contiguous.

I don't think this is the example you had in mind, in which case you will have to be more precise about what you mean by "permute".

IMHO, it won't be redundant with strides, if you have strides in nhwc (physical). Because you need a the right mapping with sizes(logic). Otherwise there is no way to tell the real order.

This is the crux of the proposal: we privilege NCHW as the logical layout, always. So if I have a 4D tensor that I know nothing about, I assume that its logical layout is NCHW. That removes the ambiguity. If you want to deal in tensors whose logical layout is not NCHW, I do think the API as stated makes life a bit hard for you.

ezyang on 15 Apr 2019

@dzhulgakov

Operations preserve memory format behavior

If physical NHWC tensors can occur purely through strides, this is technically BC-breaking, unless you make them only preserve memory format when the memory format tag is present (but it sounds like you don't want this to have semantic meaning, so I am not sure what the proposal is currently suggesting.) I'm not sure if this actually breaks anyone's code in practice though.

ezyang on 15 Apr 2019

If physical NHWC tensors can occur purely through strides, this is technically BC-breaking, unless you make them only preserve memory format when the memory format tag is present (but it sounds like you don't want this to have semantic meaning, so I am not sure what the proposal is currently suggesting.) I'm not sure if this actually breaks anyone's code in practice though.

Assuming we can make memory format 'sticky'. Op over memory formatted tensor will produce memory formatted tensor. That will solve BC problem.

However, we need to define a behavior of binary(or more members) operations when tensors have different memory formats.

VitalyFedyunin on 15 Apr 2019

@ezyang Oh I just found there is a typo in my above reply. (I am sorry for that. However the original example is still correct.) Let me restate it as below:

I have a NCHW tensor (physically, contiguous).
Then I permute it to NWHC (logically).
I want to further permute it to NHWC with a contiguous() call followed.
Use it as NHWC (physically).

But I got it NHWC contiguous already after step 2. Then I may skip step 3 and use it as NHWC directly in step 4. But this is surely not correct because the tensor's physical order does not change at all.

uyongw on 15 Apr 2019

Blocked formats require a completely different set of operators implementation; for that reason, we prefer not to mix them with 'memory format', which is by definition supposed to be friendly with all existing operators and work with same or better performance.

Yes we can enable NHWC as the first step. However I don't actually think blocked format is really something totally different. It can be naturally expressed (with some good abstraction). If there is a general format description, then others can just register new formats with arbitrary blocking/strides.

More if we have blocked support already, we don't bother to create some hidden constructs to run everything underlying, which creates an implicit world inside and the from/to between the two worlds may become an issue.

Anyway it may be too far away to think about blocked format. But I would think if possible, better to make the design extensible.

uyongw on 15 Apr 2019

But I got it NHWC contiguous already after step 2. Then I may skip step 3 and use it as NHWC directly in step 4. But this is surely not correct because the tensor's physical order does not change at all.

OK, I understand your example now. You may indeed stop at step 2 and use it as if it were an NCHW tensor; in which case, you will improperly interpret W as C, etc. This is definitely a downside with the stride-based implementation (@dzhulgakov, we should probably add this to the proposal). The proposal has some provision for this case:

To solve above issue, initial proposal is to introduce “soft” memory format tag on tensor that record the last to(memory_format) call done on tensor. Operators would need to propagate this annotation to the outputs. Annotation is “soft”, so we won't hard-error on mismatching annotations but rather produce warnings in profiling mode.

The soft memory format tag would let you distinguish from a NCHW tensor that you permuted, versus a tensor that is actually, physically, NHWC. But the soft tag in its current form is non-binding, so I'm not sure how useful it would actually be for this case.

Another way to solve the problem is with named tensors. With named tensors, we can use the names on the (logical) dimensions to figure out if we are viewing a tensor as NCHW (the assumed default) or something else.

However I don't actually think blocked format is really something totally different. It can be naturally expressed (with some good abstraction). If there is a general format description, then others can just register new formats with arbitrary blocking/strides.

There's more commentary on the topic here: https://github.com/pytorch/pytorch/issues/16038#issuecomment-454490374

ezyang on 15 Apr 2019

@ezyang Thanks for the reply. Yes soft format tag may help. The concern is it may be not flexible enough as the dimension order can be arbitrary. Also it itself is not computable. Named tensor has semantic meaning for each dimension, but may need some more facilities to support I doubt.

Personally I would think this can be solved by introducing a map from strides order (physical) to NCHW sizes order (logical). As I proposed above, for NCHW it is almost same as current design; for NHWC, sizes is still NCHW, strides will be in (N, H, W, C) order. And we use stride_index = (0, 2, 3, 1) to specify the dimension index of strides.

More, the combination of strides and stride_index can be used to represent any tensor format. This may give flexibility to others to register new data format.

uyongw on 16 Apr 2019

@ezyang

Operations preserve memory format behavior

If physical NHWC tensors can occur purely through strides, this is technically BC-breaking, unless you make them only preserve memory format when the memory format tag is present (but it sounds like you don't want this to have semantic meaning, so I am not sure what the proposal is currently suggesting.) I'm not sure if this actually breaks anyone's code in practice though.

When arithmetic operations and threshold were moved to TensorIterator, that was technically BC-breaking (because memory format of operands used to be not preserved, and TensorIterator preserves it). Status quo now is very inconsistent - threshold preserves layout, all other unary operations don't, torch.where does not, arithmetic operations preserve layout if both operands have the same layout, but would default to "nchw" or tensor that is contiguous in current understanding if there is a mismatch, I'm not sure what happens for broadcasting.
You are also making a good point about empty_like and the like preserving layout being not BC. Perhaps it will also need a layout argument, like is_contiguous in the proposal

x.is_contiguous(torch.memory_format.channels_first)

ngimel on 17 Apr 2019

@ezyang @ngimel

There is one problem with empty_like; the currently defined semantics are that you drop all stride information, so, it's not possible to preserve layout and be BC.

You are also making a good point about empty_like and the like preserving layout being not BC.

If we don't rely on strides to express physical order, empty_like does not necessary break BC. There are 3 kinds of dimension info in tensor:

shape: sizes
logic order: order info recorded in strides (typically used to support transpose or permute)
physical order: NCHW or NHWC (can be addressed as stride_index as I proposed).

Currently physical order is same as shape/sizes. So we just drop logic order in strides. Consider we are decoupling shape and physical order, we can also just drop logic order but preserve shape and physical order for empty_like. That means both size() and stride_index()will be preserved, but stride() will be reset. Especially, empty_like of a NHWC tensor will return a NHWC contiguous tensor with same shape info specified.

uyongw on 17 Apr 2019

@uyongw I'm not sure it would be a good idea to change empty_like; right now its semantics match numpy's empty_like.

Status quo now is very inconsistent - threshold preserves layout, all other unary operations don't, torch.where does not, arithmetic operations preserve layout if both operands have the same layout, but would default to "nchw" or tensor that is contiguous in current understanding if there is a mismatch, I'm not sure what happens for broadcasting.

@ngimel, yes, these are not very consistent right now. I think a part of working out how to represent memory format is to get our operators to a consistent state

zou3519 on 17 Apr 2019

@zou3519 numpy's empty_like that you linked has order argument that defaults to " match the layout of prototype as closely as possible.". That's not what empty_like in pytorch does currently (it returns "nchw"- contiguous tensor, even if prototype is discontiguous)

ngimel on 17 Apr 2019

Oh, I see, I was reading that too quickly. In that case it would be nice to have our empty_like match numpy's as well and it would (probably?) be good to have for memory layout here as well

zou3519 on 17 Apr 2019

@zou3519 Yeah what I am trying to say is to keep the current semantics (drop logical order as @ezyang and @ngimel mentioned) and in the same time preserve physical layout like numpy’s defaults. Thus for NCHW prototype the behavior will be same as before. For NHWC prototype its’ behavior will be still compatible, i.e., the new tensor will be NHWC contiguous, instead of NCHW contiguous if you don’t change the current implementation.

uyongw on 18 Apr 2019

Two questions:

What happens if a NHWC tensor is added to a NCHW tensor?
What about addressing the disadvantage of (B) by creating methods like t.channel_dim() on a tensor that return the integer value indicating where the dimension is physically? This approach may even be required to allow other formats, like block formats, be chosen without network changes.

If we address the con of (B) with the last bullet point, then (B) seems preferable to me. It's intuitively clear and logical errors should be easy to detect. All existing ops can work on the tensor, too, since it looks like any other contiguous tensor. Ops that can understand semantics (analogous to the named tensor proposal) will perform as expected, too.

mruberry on 19 Apr 2019

@zou3519 numpy's empty_like that you linked has order argument that defaults to " match the layout of prototype as closely as possible.". That's not what empty_like in pytorch does currently (it returns "nchw"- contiguous tensor, even if prototype is discontiguous)

We are planning to keep format in such cases (for memory formatted tensors)

What happens if a NHWC tensor is added to a NCHW tensor?
Operation with memory formatted tensor will return memeory formatted tensor. If both tensors are memory formatted output format would be determined by the first tensor.

VitalyFedyunin on 19 Apr 2019

Two things I would add:

We are planning to keep format in such cases (for memory formatted tensors)

We'd need to audit existing usages, because often operators will call empty_like and then assume they are NCHW contiguous. And I don't know how we'd deal with third party code. It seems like we'd need a different default than numpy if we want to preserve BC.

Operation with memory formatted tensor will return memory formatted tensor. If both tensors are memory formatted output format would be determined by the first tensor.

I'd also add, if you really care what format your output comes in -- pass in an output tensor.

gchanan on 19 Apr 2019

Agree on empty_like, there are quite a few cases where the result of empty_like/zeros_like etc is assumed nchw-contiguous (physically contiguous I should say, in many cases it's not image operations).
Passing output tensor is not an option in most cases, because functions with out kwarg are not differentiable.

ngimel on 19 Apr 2019

Many of our problems come from the inconsistency of expected output layouts. We can't solve them all at once, but we can try to lock current state (at least for strides) and nail down them one by one. So here is the proposal.

Python API

Introduce new torch.memory_format

torch_memory_format.any # default value
torch_memory_format.preserve
torch.memory_format.contiguous # what most of the functions now behave as default
torch.memory_format.nchw # requires 4D tensor, contiguous memory
torch.memory_format.nhwc # requires 4D tensor, restrided/permuted memory

The tensor will require explicit memory format conversion

x = torch.zeros((10,3,32,32)) # NCHW
x.permute(0,2,3,1).is_contiguous(memory_format=torch.memory_format.nhwc) == False # because memory still layed out as NCHW

To 'tag' them with specific format:

y = x.to(memory_format=torch.memory_format.nhwc)
y.is_contiguous(memory_format=torch.memory_format.nhwc) == True # We got new tensor with proper memory layout
y.is_contiguous() == False # Required for back compatibility
y.stride() == (3072, 3, 1, 96)

Now about empty_like and similar:

z = torch.empty_like(y) 
z.is_contiguous() == True # For BC

Because it is actually:

z = torch.empty_like(y, memory_format=torch.memory_format.any )

If we want to keep format:

z = torch.empty_like(y, memory_format=torch_memory_format.preserve) 
z.is_contiguous() == False 
z.is_contiguous(memory_format=torch.memory_format.nhwc) == True

Similarly:

z = torch.empty_like(y, memory_format=memory_format=torch.memory_format.nhwc) 
z.is_contiguous() == False 
z.is_contiguous(memory_format=torch.memory_format.nhwc) == True

That means we can slowly define each function memory_format defaults to the current state of the world, classifying them and be mindful how we change them in the future.

If you specify out tensor TensorOptions are currently ignored (in the best case they throw exception is for example passed device option mismatch with out tensor device).

Memory format supposed to be light, so any permutations will lose it.

x.zeros((10,3,32,32), memory_format=torch.memory_format.nhwc)
x = x.permute(0,1,3,2).permute(0,1,3,2)
x.is_contiguous(memory_format=torch.memory_format.nhwc) == False (even if strides are similar)

Not sure about padding, will appreciate help here.

However we can make x.to(memory_format=torch.memory_format.nhwc) 'tag' tensor with proper format and return self

Multiprocessing

Will preserve memory format 'tag'

Block memory formats

API above not relying on dimensions/strides/sizes, which means we can extend functionality in future keeping the same API.

Internal APIs

Operators would be able to branch based on memory format

if (self.memory_format(nhwc)) {
 // fast path
} else
{
 // classic implementation
}

If we do memory_format as TensorOptions, we can think about branching on dispatch level (similarly to device, layout)

VitalyFedyunin on 19 Apr 2019

Small piece of feedback @VitalyFedyunin's proposal - I think requiring 4D tensors here

torch.memory_format.nchw # requires 4D tensor, contiguous memory
torch.memory_format.nhwc # requires 4D tensor, restrided/permuted memory

is way too restrictive (because we also want to handle 1D and 3D in addition to 2D), and channels_first/channels_last from the original proposal were more accomodating for this purpose.

ngimel on 19 Apr 2019

Agree, we need better naming. channels_first sounds almost right except batch goes first =)

VitalyFedyunin on 19 Apr 2019

I like your latest proposal. Would the handling of .contiguous() change? Would you require .contiguous(memory_format=<...>)? If so, and a lot of ops simply call .contiguous(), they could still be formatting the memory improperly. Many operations today also allocate outputs as empty_like(), which would have the same effect. Would the plan be to update these to detect the memory format of the inputs and make the correct contiguous and empty_like calls?

mruberry on 20 Apr 2019

As for right now our users (and all libraries) expecting .contiguous() to return memory contiguous tensor with strides in descending order.

We can't break this contract. However, the good news is: as soon as we support memory_format option, JIT would be able to understand when it is more efficient to call .contiguous(memory_format=...) instead of the classic format.

VitalyFedyunin on 22 Apr 2019

👍1

@VitalyFedyunin Do we assume that operations like below are not allowed?

x.zeros(10,3,32,32)
# x is in nchw (default)
# x.size() is [10,3,32,32]
# x.stride() is [3*32*32, 32*32, 32,1]
x = x.permute(0,2,3,1)
# At this point 
# x.size() is [10,32,32,3], size is not in nchw order
# x.stride() is [3*32*32, 32,1,32*32]

# How can this be supported?
y = x.to(memory_format=torch.memory_format.nhwc)

One more variant would be:

x.zeros(10,3,32,32)
# `x` is in nchw (default)
# x.size() is [10,3,32,32]
# x.stride() is [3*32*32, 32*32, 32,1]
x = x.permute(0,2,3,1)
x=x.contiguous()
# At this point 
# x.size() is [10,32,32,3], size is not in nchw order
# x.stride() is [32*32*3, 32*3,3,1]

# How can this be supported?
y = x.to(memory_format=torch.memory_format.nhwc)

raghuramank100 on 3 May 2019

@raghuramank100 - why would user call .permute(0,2,3,1) in the first place? All tensors in this proposal have semantic size of (n,c,h,w), meaning that size(1) returns you channels. That's what PT's standard library assumes today and what it'd assume in this proposal too. So one would likely never call .permute at all

dzhulgakov on 3 May 2019

Can a context manager be useful to allow the user to override the memory format of allocated tensors within the manager scope to specific format ?

with torch.memory_format(torch.memory_format.nhwc):
    # a will be allocated with the context managed memory format   
    a = torch.randn(...)

# b will be allocated matching some assumed default format
b = torch.randn(...)

mfuntowicz on 3 May 2019

I don't like the idea of context manager, as it will loosen up control of memory_format.

For example:

with torch.memory_format(torch.channels_last):
  x = torch.randn(10,3,32,32) # this one is NHWC
  y = torch.randn(10,10) @ this one is not

When explicit memory_format makes it clear:

x = torch.randn(10,3,32,32).to(memory_format=torch.channels_last) # this one is NHWC
y = torch.randn(10,10).to(memory_format=torch.channels_last) # This is errors out as dim == 2

If necessary we can add syntax to allow:

x = torch.randn(10,3,32,32, memory_format=torch.channels_last)

VitalyFedyunin on 3 May 2019

@raghuramank100 there is no need to permute.

y = x.to(memory_format=torch.channels_last)

Will do all dirty work for you, keeping dims order the same as in x.

So:

x = torch.randn(10, 3, 32, 32)
nhwc = x.to(memory_format=torch.channels_last)
self.assertFalse(nhwc.is_contiguous())
self.assertTrue(nhwc.is_contiguous(memory_format=torch.channels_last))
self.assertEqual(nhwc, x)

And you can keep addressing nhwc in this format

nhwc[N][C][H][W]

VitalyFedyunin on 3 May 2019

@VitalyFedyunin That makes sense.

From a user point of view, the naming of the method (if it stays like this) seems misleading to me as "to" is already the recommended way for transferring Tensor to different devices.

Also, what about something like Numpy's one for converting C_ORDER and F_ORDER arrays ?

numpy.asfortranarray()
numpy.ascontiguousarray()

One can easily imagine something like:

torch.randn(32, 3, 64, 64).to(device).as_nhwc()

mfuntowicz on 3 May 2019

👍1

@VitalyFedyunin : I understand that the conversion to a different memory_format eliminates the need for users to permute manually. However, once this functionality is available in torch, what would happen if users called the functions in the sequence I outlined above? We should atleast have a warning/error message stating that the layout transformation failed.

raghuramank100 on 3 May 2019

@VitalyFedyunin : I understand that the conversion to a different memory_format eliminates the need for users to permute manually. However, once this functionality is available in torch, what would happen if users called the functions in the sequence I outlined above? We should atleast have a warning/error message stating that the layout transformation failed.

This is going to be possible only when we implement named tensors. Because right now:

x.zeros(10,10,10,10)
x = x.permute(0,2,3,1)

Nobody can tell me if I just created nchw or nhwc.

VitalyFedyunin on 3 May 2019

Perhaps I misunderstood the original proposal, but isn't the recorded memory format tag supposed to disambiguate this situation?

ezyang on 6 May 2019

@VitalyFedyunin Makes sense, we need to make sure that this is communicated to end users when this API stabilizes.

raghuramank100 on 6 May 2019

@dzhulgakov @VitalyFedyunin After reviewing #19975, I have some new concerns about the recorded memory format tag in tensor. My basic problem is, how are we to decide if operations should preserve memory tag? Originally, I had thought that only "alternative layout aware" operators would need to have these smarts. But looking at Vitaly's patch, I think some core operators are also going to need adjusting as well. For example, consider x[0]; if x was previously an NHWC tensor, then I should get out a HWC tensor after doing this. I'm fairly sure that Vitaly's patch doesn't handle this correctly, and I bet that would be very confusing to users. Perhaps the only operators that are affected are those that muck about with strides (in which case, there aren't too many of them and we can manually audit them), but it seems like a thing we ought to do. What do you think?

ezyang on 10 May 2019

Wait, tensors are still stay indexed in the order of: 0-dim N; 1st-dim C; 2nd-dim H; 3rd-dim W. So x[0] returns tensor with 0-dim C; 1st-dim H; 2nd-dim W. Regardless if x was channels_first or channels_last memory layout.

Otherwise memory_format just makes no sense and we need only to permute tensor.

VitalyFedyunin on 10 May 2019

My point is that the memory format tag isn't preserved. If the input tensor was tagged channels_last, the new tensor is tagged any

ezyang on 10 May 2019

cc @zou3519, the layout propagation logic here reminds me a lot of named dimension propagation in the named tensor work.

ezyang on 10 May 2019

I'm still catching up on this proposal. But @ezyang we could keep track of the layout propagation logic by propagating a per-dimension flag (or name) and then it would be equivalent to having named tensors with name conventions

zou3519 on 13 May 2019

It would be neat if we could line up the memory tag logic and the named tensor logic exactly, even if we have them as two separate implementation paths in the beginning.

ezyang on 14 May 2019

👍1

Phase 1

Expands functionality of two tensor functions .is_contiguous and .contiguous (both python and c++ api).

Note: We had several complaints about .to(memory_format) function, and decided not to support it.

.contiguous now support optional keyword-only argument - memory_format, which can be either torch.contiguous_format or torch.channels_last.
- Using torch.contiguous_format will preserve existing .contiguous() behavior.
- Calling x.contiguous(memory_format=torch.channels_last) returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
  
  x.contiguous(memory_format=torch.channels_last) expects input tensor to be 3d, 4d or 5d; and fails otherwise.
.is_contiguous now support optional keyword-only argument - memory_format, which can be either torch.contiguous_format or torch.channels_last.
- x.is_contiguous(memory_format=torch.contiguous_format) preserves same functionality as x.is_contiguous() and remains unchanged.
- x.is_contiguous(memory_format=torch.channels_last) returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.

Note: By the end of the phase one x.is_contiguous(memory_format=torch.channels_last) will calculate state of the Tensor on every call. This functionality going to be updated later.

VitalyFedyunin on 14 May 2019

Phase 2

Preserve memory format for specific operations:

Unary element-wise operators preserve channels_last memory format.

a = torch.randn(N,C,H,W)
b = a.contiguous(memory_format=torch.channels_last)
c = b.sin()
c.is_contiguous(memory_format=torch.channels_last) == True

Binary element-wise operators ( add, sub, mul, div) preserve channels_last memory format.

a = torch.randn(N,C,H,W)
b = a.contiguous(memory_format=torch.channels_last)
c = b * torch.randn(H,W)
c.is_contiguous(memory_format=torch.channels_last) == True

Any operations over sizes, strides and dims order reset memory format.

a = torch.randn(N,C,H,W)
b = a.contiguous(memory_format=torch.channels_last)
c = b.permute(0,2,3,1).permute(0,3,1,2)
c.is_contiguous(memory_format=torch.channels_last) == False

Remains undecided

Result of the reshape (and similar) operation, if output is 'channels_last' legible

import torch
a = torch.randn(N,C,H,W)
b = a.contiguous(memory_format=torch.channels_last)
c = b.reshape(N,C,-1)
c.is_contiguous(memory_format=torch.channels_last) # ?

Note: Currently memory_format not preserved

Result of the NHWC + NCHW operation. Is it NHWC ?

Note: Currently NHWC + NCHW -> NHWC and NCHW + NHWC -> NHWC

VitalyFedyunin on 14 May 2019

What about operations like cat/split? It will be useful for them to preserve the memory format.

raghuramank100 on 16 May 2019

@ezyang - regarding indexing I think we should stop somewhere. Different memory layout are not fully transparent and some ops should be allowed to disregard them. I'd argue that x[0] should be allowed to erase the tag, including x[0].unsqueeze(0)

As Raghu mentioned, cat/split should preserve the tag if possible though as it's quite a common usage. I think the general rule of thumb should be that as long as operating doesn't change rank or reorders axis weirdly, we should preserve the tag. If rank changes - all bets are off.

dzhulgakov on 22 May 2019

I agree in some cases we will lose the tag. But I would disagree about x[0]. That to me seems like a very common way to go from NCHW to CHW.

ezyang on 22 May 2019

After several conversations about how confusing it is to have Tensors to carry (or not) channels_last 'tag' we decided to take the risk of introducing bc-breaking change and auto-promote tensors to the channels_last format.

What does it mean to the API:

Any 3d,4d,5d tensors with strides like N,1,H,[W,[D]] will automatically get channels_last memory format.

To make it work, we will take special precautions to guarantee that operators on channels_last tensors that outputs channels_last tensors will have at least similar performance to operators on contiguous tensors.

In the case of the worst scenario:
1) Users can call .contiguous() on output.
2) We will write auto-promoting code in such a manner that it would be near to trivial to change this behavior.

Side effects of such auto promotion are:

import torch
x = torch.randn(10,16,16,3).permute(0,3,1,2) 
x.is_contiguous(memory_format=torch.channels_last) == True

On another hand it can solve the case (after light modifications):

import torch
x = torch.randn(10,3,16,16).contiguous(memory_format=torch.channels_last)
x = x[0].unsqueeze(0)
x.is_contiguous(memory_format=torch.channels_last) == True

VitalyFedyunin on 11 Jun 2019

From slack conversions, per @ezyang's request

Natalia Gimelshein [2:19 PM]
So I take it there would be no concept of tag.

import torch
#batch = 10, channels = 4, spatial dimensions = 16
x = torch.randn(10,16,16,4).permute(0,3,1,2)
x.is_contiguous(memory_format=torch.channels_last) == True
y = torch.randn(10,16,16,2).permute(0,3,1,2)
x1,x2 = x.chunk(2, dim=1) #chunk along channels dimension, no longer contiguous
x1.is_contiguous(memory_format=torch.channels_last) == False #right? So, if a tensor like this comes into e.g. convolution, what am I supposed to do with it? Did it want to be NHWC? Did it want to be nchw?
z=y+x1 #y is channels_last, x1 is something, what is the z layout?```

Vitaly Fedyunin [8:23 AM]
z is going to be channels_last

Vitaly Fedyunin [8:25 AM]
if x1 is not channels_last in any of proposed variants (unless we change chunk function not to return views), so convolution will convert it to contiguous(channels_first) format and return contiguous as well

Vitaly Fedyunin [9:12 AM]
@ngimel thank you for the feedback, I think we can come out with more meaningful definition of the channels_last to cover most of the cases when view-like operations is involved. Will keep you in loop.

Natalia Gimelshein [9:36 AM]
replied to a thread:
So it seems to be a problem, no? Chunking across channels dimension is a relatively common thing, e.g. in inception-like networks. So if the tensor is chunked channels first tensor, convolution output will be channels-first (which is intuitive behaviour, and most likely what user wants), if the tensor is chunked channels-last then convolution output will once again be channels first?

Natalia Gimelshein [9:39 AM]
replied to a thread:
But only due to non-commutative addition behavior and y being first argument and channels last, right? What would be the result for x1+y? Do we have layout propagation rules for binary operations somewhere?

Vitaly Fedyunin [10:44 AM]
1) Yes, it is problem we are going to solve with alternative proposal. I'm going some tests now and will write it down this week (in day or two).
2) x1+y - should also produce channels_last otherwise it is confusing, and yes, we will have layout propagation rules written down.

ngimel on 19 Jun 2019

I think the observation I made to @VitalyFedyunin when we chatted about this in-person (but I don't think I remembered to write this down anywhere), is that there is a degree of freedom in convolution, which is that when it gets an argument whose memory layout doesn't match any that it knows how to efficiently implement, which layout should it contiguify to? For BC-reasons, contiguifying to channels first is required, but we've made an arbitrary decision here--arguably you could contiguify to channels last too. Perhaps we should have some sort of thread local toggle which says what the defaults are?

But it seems like there are a lot of details here to thrash out, and I am not sure if it works out in the end.

ezyang on 19 Jun 2019

So the hazyness of convolution (and other layout-aware operators, for that matter, e.g. upsampling that I've recently looked at starts by calling .contiguous() on the input - so what is it supposed to mean?) was the primary reason for introducing the tag, iirc.

ngimel on 19 Jun 2019

Yeah, so I'm OK with cracking open the tag design again, but then we
have to seriously solve the problems of how to propagate these tags,
even when you lose layout (as would have been the case with chunking
on channels). I am much more fond of making "current layout" some
sort of context manager, than making it data dependent.

Excerpts from ngimel's message of 2019-06-19 12:43:45 -0700:

So the hazyness of convolution (and other layout-aware operators, for that matter, e.g. upsampling that I've recently looked at starts by calling .contiguous() on the input - so what is it supposed to mean?) was the primary reason for introducing the tag, iirc.

ezyang on 20 Jun 2019

apaszke on 27 Jun 2019

👍2

layout option was considered (check appendix), but we found it will lead to lots of code duplication as well as disallow auto converting tensors to a different memory_format on fly

after all memory_format is way to stride tensor and to easy pick optimized kernels and outputs which is property of strided tensor, not a completely different class

VitalyFedyunin on 28 Jun 2019

In some sense sparse layouts are also a way to easily pick optimized kernels for arrays that are mostly zero 😄 Can you elaborate on the "as well as disallow auto converting tensors to a different memory_format on fly" part please?

apaszke on 29 Jun 2019

This might be a naive question, but why is PyTorch considering this API versus just exposing an option to use NHWC in the ops themselves, which would directly call the underlying CuDNN kernel where available?

It seems like for a common use case (mixing image ops like conv and pooling with LM architectures) this would be an easy solution. As a developer, all I want is a Conv2d(..., nhwc=True). Is there some reason why this doesn't make sense?

rewonc on 20 Sep 2019

👍1

@rewonc we have considered the similar approach (adding option to operators instead of the deriving kernel from striding), and found it hard to apply for following reasons:

This approach will require the kernel to do restriding of contiguous tensor to apply NHWC kernel.
Next operator will have to restride input again (to contiguous) unless it also has nhwc=True option.
To have NHWC across the network, every single operator would need nhwc=True option.

PS. If you concerned about CudNN Ex functions, we are looking to expose cudnn_batch_norm_nhwc and similar operators.

VitalyFedyunin on 20 Sep 2019

👍1

Hi @VitalyFedyunin, we saw the named tensor was supported in PyTorch 1.3. Can that solve (or partially solve) the concerns on NHWC (or even blocked) format support? Is there any plan to move forward the NHWC state based on named tensor?

uyongw on 23 Oct 2019

We are moving ahead with channels last support, I will publish roadmap this week here and in slack channels. We are not considering adding blocked formats any time soon (as it will require rewriting ALL operators).

VitalyFedyunin on 23 Oct 2019

Thanks. That’ll be good!

uyongw on 23 Oct 2019

Tacking tasks and progress inside of https://github.com/pytorch/pytorch/issues/28619

VitalyFedyunin on 25 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings