Many non-experts are using the following code http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow?answertab=votes#tab-top.
It would be nice to have an official batch norm layer given its importance in training DNNs.
I'm working on some parts of that.
I think some thing wrong with this layer. in training every thing is OK and loss decrease very good. but in testing I get zero accuracy.
By the way in testing when I use is_training=False, I get zero acc.
I know batch normalization behave different in train and test phase, as describe in How does batch normalization behave differently at training time and test time? - Quora. I think this implementation is unclear
Same here, I have experienced some unexpected behavior with is_training=False. What is the correct way to change this flag? I am currently using a tf.cond
because it does not take tf.placeholders
by itself.
@pawni You have to use a Python boolean for is_training
. It cannot be a tf.cond
.
@ppwwyyxx well I am doing tf.cond(placeholder, batch_norm(.., is_training = True), batch_norm(.., is_training = False))
or is one just supposed to do a batch_norm(.., is_training=variable)
and change that outside of the graph when needed?
Oh I thought you were doing batch_norm(.., is_training=tf.cond(placeholder))
, which is incorrect.
Your current way might have problems as well. You'll need to double check that the two batch_norm
op you created share the same scope, otherwise they won't share the underlying mean/variance statistics.
To do this the reuse
argument might help, but I'm not sure because I use my own version of bn layer.
I am using the same scope and reuse=True
. It seems to work sometimes but I am not too sure. It would be great if the layer could be added to the documentation with a short explanation how to best handle the change from training to test.
@sguada FYI
Currently batch_norm requires a python boolean, but we are working in adding the option of passing a Tensor.
@pawni If you don't want to worry about about updating moving_mean and moving_variance set updates_collections=None to make sure they are updated in place, otherwise you need to make sure the update_ops added to tf.GraphKeys.UPDATE_OPS are run during training.
I think tensorflow need 2 hyper methods that change the model state, something like torch. change model state. I think it is very straightforward.
is there a small script with a very simple NN that shows what is the proper way of using this "official" BN layer? I'd really appreciate it.
sorry if this is a little repetitive, but it seems the API talks about BN in a different interface: https://www.tensorflow.org/versions/r0.9/api_docs/python/nn.html#batch_normalization
is that not the official way to use BN? I am confused on how to use it and the SO seems to be outdated and then there is a layer in a different link from the API, just how exactly does one do this? I am unclear if to go to SO or ask here.
sorry for the spamming, but what is wrong with just using something like this:
def standard_batch_norm(l, x, n_out, phase_train, scope='BN'):
"""
Batch normalization on feedforward maps.
Args:
x: Vector
n_out: integer, depth of input maps
phase_train: boolean tf.Varialbe, true indicates training phase
scope: string, variable scope
Return:
normed: batch-normalized maps
"""
with tf.variable_scope(scope+l):
#beta = tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float64 ), name='beta', trainable=True, dtype=tf.float64 )
#gamma = tf.Variable(tf.constant(1.0, shape=[n_out],dtype=tf.float64 ), name='gamma', trainable=True, dtype=tf.float64 )
init_beta = tf.constant(0.0, shape=[n_out], dtype=tf.float64)
init_gamma = tf.constant(1.0, shape=[n_out],dtype=tf.float64)
beta = tf.get_variable(name='beta'+l, dtype=tf.float64, initializer=init_beta, regularizer=None, trainable=True)
gamma = tf.get_variable(name='gamma'+l, dtype=tf.float64, initializer=init_gamma, regularizer=None, trainable=True)
batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema.average(batch_mean), ema.average(batch_var)))
normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
return normed
then its simple to tell tensorflow which one to use with a feed dictionary as in:
feed_dict = {x: Xminibatch, y_: Yminibatch, phase_train: True}
sess.run(fetches=[merged,train_step], feed_dict=feed_dict)
since its unclear if the implementation will change, I wanted to give a suggestion (note its easy to extend to convolutions and stuff I just didn't paste that code).
@pawni @ppwwyyxx did you guys decide if you had to use reuse to true to solve the scoping issue?
@brando90 currently I am doing something like:
def BatchNorm(inputT, is_training=True, scope=None):
return tf.cond(isTraining,
lambda: batch_norm(inputT, is_training=True,
center=False, updates_collections=None, scope=scope),
lambda: batch_norm(inputT, is_training=False,
updates_collections=None, center=False, scope=scope, reuse = True))
However, I think that #3265 would basically want to implement it like this. A reference could be the dropout implementation here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L433-L435
When the updates_collections=None then the updates happens in-place and it is easier to use a tf.cond() to allow is_training being a Tensor a bit more complicated is when the updates are delayed and the the update_ops are run later.
I will try to get the first part in soon.
@brando90 @pawni he's code works good, but have to change like below
def BatchNorm(inputT, is_training=True, scope=None):
# Note: is_training is tf.placeholder(tf.bool) type
return tf.cond(is_training,
lambda: batch_norm(inputT, is_training=True,
center=False, updates_collections=None, scope=scope),
lambda: batch_norm(inputT, is_training=False,
updates_collections=None, center=False, scope=scope, reuse = True))
And when run in training or test time,
# when training
sess.run([opt, loss], feed_dict={x: bx, y: by, is_training=True})
# when test
sess.run([opt, loss], feed_dict={x: bx, y: by, is_training=False})
This code works, but like #3265 says it will be great if tf.contrib.layers.batch_norm
get is_training
variable as a tf.plcaeholer
.
@nmhkahn @pawni thanks for the code snippets. They were very useful in adding batch normalization to my convolution network. Training seems to work very well. Testing is not. In some versions of the code training accuracies are much higher than testing accuracies, which probably mean I am not sharing batch normalization parameters. In other versions of the code I get "ValueError: Variable conv1/beta already exists, disallowed. Did you mean to set reuse=True in VarScope?" which seem to indicate that I am trying to relearn the parameter... when I was trying to reuse.
Can someone provide an example of how to call the "def BatchNorm" function during training and testing so that variable sharing happen correctly.
Thanks for any help.
UPDATE July 25, 2016:
@nmhkahn @pawni thanks for your comments. After taking a closer look at the code in contrib I realized what my problem was. During training and testing we are either updating or reusing four variables (beta, gamma, moving_mean and moving_variance). To make those unique I had to set a scope per layer. I did it like this:
conv1 = tf.nn.relu(batch_norm_layer(conv2d_stride2_valid(data, W_conv1) + b_conv1, train_phase, scope="conv1"))
where batch_norm_layer is similar to the examples from @nmhkahn @pawni, conv2d_stride2_valid is just a def to define a convolutional layer, and W_conv1 and b_conv1 are variables holding the weights and biases. I could probably remove the bias term because we are using batch normalization.
The net is working well now. I noticed after plotting accuracies in training and test mode that the testing accuracies start climbing after the training accuracies. In retrospect it make sense since we are collecting dataset statistics for testing. But it appeared as if I was doing something wrong during my initial tests. Thanks for your comments and making batch normalization available to the community.
@nmhkahn how is it different from pawni's suggestion?
@brando90 I had a small error in my version which was fixed by nmhkahn (changing isTraining
to is_training
)
@diegoAtAlpine I found the same problems - not sure why this is the case though. However, the ValueError should be resolved by the code snippet. Not sure what you want to see how to call it as nmhkahn's examples seems to do the job?
@nmhkahn @pawni @ when you do:
sess.run([opt, loss], feed_dict={x: bx, y: by, is_training=True})
doesn't that mean that your using is_training
as a placeholder? People have commented that they want is_training
to be a placer holder but thats what I had for my version of it:
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
is_training=True,
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
is_training=False,
reuse=True, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
is that not correct?
I have already extended tf.contrib.layers.batch_norm to allow passing a Tensor or a Placeholder for is_training. It will be merged in TF contrib soon.
Now available in
https://github.com/tensorflow/tensorflow/commit/9da5fc8e6425cabd61fc36f0dcc1823a093d5c1d#diff-94bbcef0ec8a5cdef55f705e99c2b2ed
is it just me or does adding this BN layer noticeably slows down training of a single epoch?
@brando90 It slows down training for me as well but I think that this is expected as it needs to calculate some statistics. And your version looks good to me.
BatchNorm is currently very slow (because of all the statistics computed), but they are working on adding a cudnn batchnorm op as said here.
@nmhkahn quick question. When you wrote (for testing):
sess.run([opt, loss], feed_dict={x: bx, y: by, is_training=False})
in theory, can bx and by be any data set? i.e. it can still be the training set even though we are not training? (i.e. just to track the train error)
@brando90 you're right.
I am also confused regarding is_training and reuse flags. I have created a program following the CIFAR example, where my code is structured as in CIFAR:
And I am running it in a multi-gpu fashion (for training).
So I have one script for training (similar to cifar10_multigpu.py) and one for testing (similar to cifar10_eval.py).
So
for ii in xrange(2): # Num of GPU
with tf.device('/gpu:%d' % ii):
with tf.name_scope('device_%d' % ii) as scope:
data_batch, label_batch = factory.GetShuffleBatch(batch_size)
unnormalized_logits = factory.MyModel(dataBatch=data_batch, numClasses=numClasses,
isTraining=True)
More stuff happening
tf.get_variable_scope().reuse_variables()
The inference happens with the function MyModel. (below is an example of the function, in reality i use more layers and neurons).
def MyModel(data_batch, num_classes, feature_dim):
# Hidden Layer 1
with tf.variable_scope('hidden1') as scope:
weights = variable_on_cpu('weights',[feature_dim, 256], tf.truncated_normal_initializer(stddev=0.04))
biases = variable_on_cpu('biases', [256], tf.constant_initializer(0.001))
hidden1 = tf.nn.relu(tf.matmul(data_batch, weights) + biases, name=scope.name)
# Hidden Layer 2
with tf.variable_scope('hidden2') as scope:
weights = variable_on_cpu('weights',[256, 256], tf.truncated_normal_initializer(stddev=0.04))
biases = variable_on_cpu('biases', [256], tf.constant_initializer(0.001))
hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases, name=scope.name)
# output, unnormalized softmax
with tf.variable_scope('softmax_unnorm') as scope:
weights = variable_on_cpu('weights', [256, num_classes], tf.truncated_normal_initializer(stddev=1/num_classes))
biases = variable_on_cpu('biases', [num_classes], tf.constant_initializer(0.0))
softmax_un = tf.add(tf.matmul(hidden2, weights), biases, name=scope.name)
return softmax_un
I want to perform batch nomalization. So when I did:
def MyModel(data_batch, num_classes, feature_dim, isTraining):
with tf.variable_scope('bnormalization') as scope:
norm_data_batch = tcl.batch_norm(inputs=dataBatch, epsilon=0.0001, is_training=isTraining,
reuse=True, scope=scope)
# Hidden Layer 1
with tf.variable_scope('hidden1') as scope:
weights = variable_on_cpu('weights',[feature_dim, 256], tf.truncated_normal_initializer(stddev=0.04))
biases = variable_on_cpu('biases', [256], tf.constant_initializer(0.001))
hidden1 = tf.nn.relu(tf.matmul(data_batch, weights) + biases, name=scope.name)
I got the following error in the training phase:
Variable bnormalization/beta does not exist, disallowed. Did you mean to set reuse=None in VarScope?
From what I 've been reading in this thread in the training phase I should be using reuse=None. Have I got this part correct? If this is true, then since I am using two GPUS, should I do reuse=None in the first GPU and reuse=True in the second? Or since I am doing tf.get_variable_scope().reuse_variables() it takes care of itself?
Finally, in the testing phase, should I have is_training=False and reuse=True?
Any help is greatly appreciated.
Now tf.contrib.layers.batch_norm accepts a Tensor, Variable or Placeholder as is_training
https://github.com/tensorflow/tensorflow/commit/9da5fc8e6425cabd61fc36f0dcc1823a093d5c1d#diff-94bbcef0ec8a5cdef55f705e99c2b2ed
Is it normal that Batch Normalization makes my experiments worse? I tried it on a 2 layered NN network based on the MNIST beginner tutorial and I consistently get worse results when BN is present: with BN (one with scale and center trained and the other not) accuracy is 0.8423, 0.8221 and without BN accuracy is 0.9477.
My script is present here https://github.com/brando90/tensor_flow_experiments/blob/master/tf_tutorials/beginner_tutorial_MNIST_BN.py
anyone has experienced these problems or is BN just like this and I need to do something else to make it work?
The latest version of tf.contrib.layers.batch_norm now accepts a placeholder for is_training so not need to do it yourself.
But what it is important is that either you pass updates_collections=None so the moving_mean and moving_variance are updated in-place, otherwise you will need gather the update_ops and make sure they are run.
I would like to encourage you to use tf.contrib.layers
or tf.contrib.slim
to build your model.
slim = tf.contrib.slim
def build_NN_two_hidden_layers(x, is_training):
batch_norm_params = {'is_training': is_training, 'decay': 0.9, 'updates_collections': None}
with slim.arg_scope([slim.fully_connected],
activation_fn=tf.nn.relu,
weigths_initializer=tf.contrib.layers.xavier_initializer(),
biases_initializer=tf.constant_initializer(0.1),
normalizer_fn=slim.batch_norm,
normalizer_params=batch_norm_params):
net = slim.fully_connected(x, 50, scope='A1')
net = slim.fully_connected(net, 49, scope='A2')
y = slim.fully_connected(net, 10, activation_fn=tf.nn.softmax, normalizer_fn=None, scope='A3')
return y
@sguada I changed my old one where I manually tell it to train or not (based on a tf.cond) and now it seems the accuracy is up to ~95's again. Why was it that I needed to change updates_collections to be None? Do you mind explaining me why that gave such a big accuracy difference? Its seems like a non-trivial change (should it None be its default value then if it matters so much?). Thanks! :)
Also, I noticed you said it was a placeholder and I didn't need to do it manually. However, when I passed a placeholder for is_training it said
TypeError: Using a
tf.Tensoras a Python
boolis not allowed. Use
if t is not None:instead of
if t:to test if a tensor is defined, and use the logical TensorFlow ops to test the value of a tensor.
and pointed to batch_norm code. Maybe It could be nice to show how this placeholder thing should be used because it seems I don't understand how its suppose to be used. Thanks! :)
@brando90
The relevant part of the code is here L227-256.
As you will notice is there is a with ops.control_dependencies
statement that forces the updates. I believe that for the code to be used "right out of the box" the default should be None.
As for my comment above 1122, I figured out that tf.get_variable_scope().reuse_variables() takes care of the issue, so in the training phase the argument reuse of batch_norm should be None. It has to do with the statement variable_op_scope (read its documentation in tensorflow)
Use of batch_norm with tf.placeholder
x = tf.placeholder(tf.float32, [None, 784])
is_training = tf.placeholder(tf.bool, [], name='is_training')
y = build_NN_two_hidden_layers(x, is_training)
# For training
sess.run(y, {is_training: True, x: train_data})
# For eval
sess.run(y, {is_training: False, x: eval_data})
The problem before was that you were not updating the moving_mean
and moving_variance
after each step, when updates_collections is None it forces the updates as part of the computation.
However when a network has many batch_norm layers it is more efficient to collect all the update ops and run them together, so each layer don't need to wait for the update to finish.
y = build_model_with_batch_norm(x, is_training)
update_ops = tf.group(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
sess.run([y, update_ops])
Has there been any progress made with speeding up batch norm?
I was trying to use batch norm with a 2 layered densely connected NN with the (flatten) MNIST (and relu units) data set for the task of auto-encoding and I keep getting a NaN error. Anyone know why might this be? Is this ever possible with BN? seem fishy, but it couldn't be my learning set up, rate etc. (but I'd assume it shouldn't because BN should be sort of rubust to this)
@sguada I am not understanding the right way of using batch_norm
specially concerning the flag updates_collections
. If I understood correctly if the flag is None
the network is not efficient, so I should let updates_collections=tf.GraphKeys.UPDATE_OPS
and then I should collect all the batch_norm updates and run them together.
You collect the batch_norms updates by doing: update_ops = tf.group(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
.
I have many different models that use different batch_norm layers, this wouldn't work right?:
#model 1
y1 = build_model_with_batch_norm(x, is_training)
update_ops1 = tf.group(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
sess.run([y1, update_ops1])
#model 2
y2 = build_model_with_batch_norm(x, is_training)
update_ops2 = tf.group(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
sess.run([y2, update_ops2])
Could you explain this part with a bit more details? Thank you very much.
Just put it in seperate collection-keys:
# While building your 1st model...
tf.contrib.layers.batch_norm(..., updates_collection="updates-model1")
# same for 2nd model with key "updates-model2"
#model 1
y1 = build_model_with_batch_norm(x, is_training)
update_ops1 = tf.group(tf.get_collection("updates-model1"))
sess.run([y1, update_ops1])
#model 2
y2 = build_model_with_batch_norm(x, is_training)
update_ops2 = tf.group(tf.get_collection("updates-model1"))
sess.run([y2, update_ops2])
Nevertheless, the documentation seams to be out-dated. It tells to do the following:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if update_ops:
updates = tf.group(update_ops)
total_loss = control_flow_ops.with_dependencies([updates], total_loss)
But:
EDIT:
The documentation should be updated to s.th. like this:
from tensorflow.python import control_flow_ops
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if update_ops:
updates = tf.tuple(update_ops)
total_loss = control_flow_ops.with_dependencies(updates, total_loss)
EDIT 2:
After doing some runs on my network, I have to say that I can not see any performance difference between using _updates_collections=None_ in contrast to manually fetching _tf.GraphKeys.UPDATE_OPS_ while graph construction. Even with heavy use of batch normalization (in total, my _tf.get_collection(tf.GraphKeys.UPDATE_OPS)_ returns 140 Update-Ops, all of them are BN-ops only)
Edit: Hard to say, if my results are correct, but the whole network indeed seams to be 1.5x faster. As far as I know, BN-statistics are calculated on CPU, not GPU so far.
Can anyone of you see any performance benefits as well? Please share your results :)
Coming back to the performance issue, does the current batch norm layer benfit at all from GPU usage? Anyone has experienced benefits from GPUs with this batch norm implementation?
Sorry for the spam, but the documentation doesn't really explain how to use this BN with convolution (maybe should be provided somewhere?). In short how does it figure out that it should apply and learn the same parameters per feature (rather than per activation)?
(Is there at least a code snippet to do this?)
The slim batch_norm wrapper normalizes over the last dimension of your input tensor. So if it's a 2D input tensor coming from a fully connected layer, it normalizes over batch, and thus performs per-activation normalization. If it's a 4D tensor coming from a convolution, it will normalize over the three first dimensions (batch, width, depth), and thus perform per-feature normalization. @sguada maybe forth being a bit more descriptive about this.
@nmhkahn Regarding your code snippet, may I ask why is reuse
set to be None
when is_training=True
? Wouldn't that trigger the scaling parameter gamma
and the offset parameter beta
be re-initialized in every training step? I thought in the original paper, beta
and gamma
are "learned along with the original model parameters". To do that, shouldn't they be only initialized once and then reused in all training steps?
tf.cond(is_training,
lambda: batch_norm(inputT, is_training=True, updates_collections=None, scope=scope),
lambda: batch_norm(inputT, is_training=False, updates_collections=None, scope=scope, reuse = True))
I greatly appreciate the work that the TF team has put in here to make batch_norm available and effective. From my searching, this thread is the best resource for how to use it. There are many different problems and ideas flying around here, and it's difficult to figure out the consensus advice for the simplest standard case of how to use the batch_norm layer. I think there'd be a lot of value in expanding the documentation to specify the exact recommended usage.
My best attempt to figure that out brought me to the following code:
is_training_ph = tf.placeholder(tf.bool)
...
with tf.variable_scope('bn_test_layer') as vs:
layer_output = tf.cond(is_training_ph,
lambda: tf.contrib.layers.batch_norm(layer_input, is_training=True, center=True, scale=True, activation_fn=tf.nn.relu, updates_collections=None, scope=vs),
lambda: tf.contrib.layers.batch_norm(layer_input, is_training=False, center=True, scale=True, activation_fn=tf.nn.relu, updates_collections=None, scope=vs, reuse=True))
Then I set is_training_ph to True for training and False for testing. This doesn't work for me. The model trains fine, but the test performance is terrible. In contrast, if I maintain is_training_ph=True for test time, it works great. Thus, I'm guessing I still have a scope issue so that it's not finding the proper existing variables.
@davek44 I'm using the same code framework that you are using and I observed the same thing: when turns on is_training=True
during training phase and turns off is_training=False
for validation and/or testing phase, the model trains well like the paper described (model converges faster and I was able to use a larger learning rate), however the testing performance is terrible. If I turns on is_training=True
all the time, the model trains the same as without inserting batch norm layer. I haven't figured out what I did wrong, I'm planning to use TensorBoard to monitor the parameters. Would you please update if you diagnose the cause of this behavior?
tf.contrib.layers.batch_norm can take tensor as is_training, so not need to do anything especial.
is_training_ph = tf.placeholder(tf.bool)
outputs = tf.contrib.layers.batch_norm(layer_input, is_training=is_training_ph, center=True, scale=True, activation_fn=tf.nn.relu, updates_collections=None, scope='batch_norm'),
I see the same poor test performance with that code.
Without more details is impossible to know, my guesses are that you only train for a few iterations, so the moving_mean and moving_average haven't converge yet.
You can change the batch_size during test to see how the performance degrades as you make your batch smaller.
I see the same poor test performance with that code.
I had exactly the same problem either with tf.slim batchnorm or with tf.cond and input is_training as a placeholder.
In the former case, when investigating the trained model, I found out that the moving mean and moving variance consist of all zeros.
In the latter case, the moving mean and variance look more reasonable (with different values), but if I use is_training=False in test time, the performance is also really bad. Using is_training=True, it works better but I think it only uses the moving mean and variance inside the test batch.
@nmduc @davek44 I wrote some code to track the moving mean and moving variance computed in tf.contrib.layers.batch_norm
during training and testing. I found out that the value of decay
matters a lot (they use exponential decay to compute moving average and moving variance), with a decay
setting closer to 1.0 (i.e. decay=.999
), moving mean drops to a value closer to 0. I did 2 test runs with the exact same code but different decay
settings in the tf.contrib.layers.batch_norm
, and my validation/test accuracies seemed more reasonable.
The test run results with decay=0.9
The test run results with decay=0.999
(decay=0.999
is the default setting in tf.contrib.layers.batch_norm
)
(also seems like larger decay value would require the model to train longer to see validation accuracy change )
Yup that fixed it. Thanks for sharing your analysis @zhongyuk!
I encourage the developers to consider making decay=0.9 the default. Even 0.99 doesn't work well for me. That's the default value in Torch's implementation, too; see the momentum parameter in https://github.com/torch/nn/blob/master/BatchNormalization.lua
@zhongyuk Thanks a lot for sharing . It works for me now.
This seems important. @sguada we should consider the right course of action here before 1.0. In the short term, can one of the interested parties send me a PR documenting the fact that decay
might have to be significantly lowered when experiencing poor eval performance? I am pretty sure I've never had to tweak that parameter, but it might be a side effect of the distributed setting.
We could change the default to 0.9 or document better its impact in smaller datasets or few updates.
@vincentvanhoucke in our distributed setting we usually do millions of updates so it is ok, however in other cases like the one here which does only a few hundreds of updates it makes a big difference:
For example using decay=0.999 has a 0.36 bias after 1000 updates, but that bias goes down to 0.000045 after 10000 updates and to 0.0 after 50000 updates.
Just wanted to note that I also have the problem of poor test performance, specifically using small batch sizes (anything smaller than 10 instead of the 200 I used for training diminishes test accuracy). I've used a tf.placeholder to switch between testing/training mode.
It's great that this batch normalization layer works for better training convergence, but if you can't apply the model in production, there isn't much of a point in using it. Can anyone confirm good test performance with small or single data samples using this batch norm layer?
I can confirm that test performance is good when using is_training=False with small batches and even with batch_size=1, since it is not using statistic from the batch, but the statistic learnt during training. Just need to make sure that the statistics have converged with default decay=0.999 that implies at least 50k updates.
To follow up with TF developer's confirmation, I track the convergence of the statistics with two different decay
settings (and training batch_size=1). With decay=0.99
, the statistics converge (bias<0.001) after 550~600 steps of learning/updates. With decay=0.9
, the statistics converge (biase<0.001) within within 100 steps of learning/updates.
@sguada thanks, does that also mean the output is actually independent of the batch size? because I'm noticing very slight changes with big impact on my accuracy (maybe my definition of performance is just more easily affected by this slight change). To be precise, all values in my 128 dimensional output tensor increase such that the total vector length scales almost linearly with the batch size. Per value this isn't that much of a difference, but has a big impact when computing vector distances in latent spaces.
@zhongyuk thanks, I've run about 5k updates with decay=0.9
, so it should've converged and testing performance using large batch sizes is fine. But even if it didn't, would it result in a difference between training a testing? I'd be seeing bad performance during training and testing if it hadn't converged, right?
I will investigate some more and see if I can reproduce the issue on another task. Thanks for the quick feed back so far!
@dominikandreas If your poor testing performance is caused by statistics not converging, you'd see reasonably good training performance but bad testing performance. Because during training, the batch normalization is done using the training batch statistics only. However, during testing time, it's using the moving average statistics of all the training batches to normalize the input tensor.
I found and error in my code, batch normalization is working fine now :-) thanks for your support
Hi @zhongyuk , how did you keep track of the moving mean and variance?
Thanks!
@rogertrullo Generally I setup TensorBoard to track moving mean and variance. Other than that, I also tried fetching statistics through tf.get_variable("moving_mean")
within scope during training and reference to monitor the bias.
hi,
I have same problem as other described that I have good training results but validation/testing is bad after using batch_norm.
I use the function like this:
conv_normed1 = tf.contrib.layers.batch_norm(conv1 + block1_layer3_1_biases, updates_collections=None, scale=True, decay=batch_norm_decay, center=True, is_training=is_training )
decay value is 0.9
do I need to set the reuse flag?
I will glad for any help.
I have been using batch_norm as described in this thread (with a tf.bool for training; and ops.GraphKeys.UPDATE_OPS) and everything works.
When saving and restoring using:
saver = tf.train.Saver()
it works,
but when saving using:
saver = tf.train.Saver(tf.trainable_variables() + [global_step])
so that I can save storage space (by not saving the gradients etc)
on restore there is an error:
"uninitialized value unpool4/convc/bn/moving_mean"
Obviously this is because moving_mean (and I suppose moving_variance) hasn't been saved for any of the layers. As I have lots of them (nested in many layers) - what is the most efficient way of adding them to the list of values to be saved? Also, given that these are trainable variables, why are they not addded to the trainable_variables collection?
@mshunshin moving mean and variance are not trainable variables: there are no gradients coming to them, they are just accumulating statistics across minibatches of examples.
To save/restore them, you can use tf.global_variables()
for me things started to work when I used this wrapper:
def batch_norm_wrapper(x, phase, decay, scope, reuse):
with tf.variable_scope(scope, reuse=reuse):
normed = tf.contrib.layers.batch_norm(x, center=True, scale=True, decay=decay, is_training=phase, scope='bn',updates_collections=None, reuse=reuse)
return normed
the whole using of scopes and reuse is not clear in this thread for my opinion.
Many thanks. With tf.global_variables() the save files are much larger as I think it includes the gradients; in the end I used:
saver = tf.train.Saver([x for x in tf.global_variables() if 'Adam' not in x.name])
and because the session manager init doesn't initialise them properly:
sess.run(tf.variables_initializer([x for x in tf.global_variables() if 'Adam' in x.name]))
(Using tf.train.AdamOptimizer)
You can also use tf.model_variables() which contains the variables of the model, i.e. moving_mean
@sguada Sorry for trouble you, but is it possible to make an example on how to use slim.batch_norm when combined with slim.conv2d/slim.fully_connect in readme.md?
I'm using slim.batch_norm, but get good training performance and poor validation/test performance. I think it must be due to improper use of reuse
or scope
or some other parameters. Though there are many issues on batch normalization, it's hard to find a complete code snippet on how to use it, esp. for how to pass different parameters in different phase.
Say, in my mnist_bn code, I controlled dependencies using tf.GraphKeys.UPDATE_OPS
and set up is_training
as a placeholder. But validation performance still is poor if I feed {is_training: False}.
I would greatly appreciate it if there's an official and complete (which means training, validating, testing are all included) batch normalization example.
Thank you in advance!
hi,
you need to set different scope for every time you use batch norm and give it the reuse input according to the training/test phase(TRUE when test FALSE when train) that works for me.
@ishaybee Thanks for you help. I've found my problem= = It's due to the cold start of moving_mean/moving_variance.
Since I haven't trained enough steps, the estimated moving mean/variance is not that stable. The result turns out to be: the model performs pretty well on training mini-batches (you know at the beginning loss goes down quickly), but validation performance is erratic (because the estimated population mean/variance are not stable enough).
When I trained the model longer, validation accuracy becomes prettier, too.
Another important thing is, be sure to use slim.learning.create_train_op
to create train op. Do not use tf native tf.train.GradientDescentOptimizer(0.1).minimize(loss)
.
So the answer is, I'm using batch normalization correctly, but I haven't fully understood its dynamics during training.
================
What's more:
@soloice , notice, how in about comment the following parameter is passed inside to the layer for calling batch_norm:
batch_norm_params = {'is_training': is_training, 'decay': 0.9, 'updates_collections': None}
Without updates_collections
set to None (so mean updates are done in place inside BatchNorm), I won't expect surrounding layer (e.g. conv2d) to somehow execute tf.GraphKeys.UPDATE_OPS needed for BatchNorm layer to update running mean and therefore be able to do run on test data later.
Or you may try to run UPDATE_OPS yourself explicitly as one here
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if update_ops:
updates = tf.group(*update_ops)
cross_entropy = control_flow_ops.with_dependencies([updates], cross_entropy)
Update - I found that I quoted exactly your code and you do use UPDATE_OPS.
As for "cold start", as you see above in discussiion, decreasing BatchNorm running average decay (input param) from default 0.999 to something like 0.95 can speed-up start-up
@pavelbulanov It's very kind of you to help me with this! I'll try a smaller value of decay
to see how this helps.
================
Update: use a small decay (say, 0.9 or 0.95) does help a lot. Validation loss goes down very quickly when I set decay
to 0.9. However, the drawback of small decay is that its effective range is small: The result is dominated by a few recent samples thus it's not a good estimation of population mean/variance. One needs to balance between quick start (small decay) and a longer effective range (large decay).
Hi,
I tried to implement a batch normalisation layer with the help of the suggestions in this issue, but I still have a >70% error in validation and testing... I do have a lower decay for non-training calls...
Here is my code:
def BatchNorm(inputT, is_training=False, scope=None):
return tf.cond(
is_training,
lambda: tf.contrib.layers.batch_norm(inputT, is_training=True, reuse=None, decay=0.999, epsilon=1e-5, center=True, scale=True, updates_collections=None, scope=scope),
lambda: tf.contrib.layers.batch_norm(inputT, is_training=False, reuse=True, decay=0.900, epsilon=1e-5, center=True, scale=True, updates_collections=None, scope=scope)
)
Thank you in advance.
@Alexivia It seems that you are using two different batch normalization layers? You should use only one BN layer (of course, with different is_training
parameter).
Thank you for your advice @soloice.
I tried now with just different is_training
and reuse
parameters:
lambda: tf.contrib.layers.batch_norm(inputT, is_training=True, reuse=None, decay=0.9, epsilon=1e-5, center=True, scale=True, updates_collections=None, scope=scope),
lambda: tf.contrib.layers.batch_norm(inputT, is_training=False, reuse=True, decay=0.9, epsilon=1e-5, center=True, scale=True, updates_collections=None, scope=scope)
still don't get good validation and testing results... >70%...
hi,
please see my wrapper above.
you should use "with tf.variable_scope(scope, reuse=reuse):" I think.
Hi @ishaybee,
I followed your advice, now my code is:
def BatchNorm(inputT, is_training=False, reuse=True, scope=None):
with tf.variable_scope(scope, reuse=reuse):
return tf.contrib.layers.batch_norm(inputT, is_training=is_training, reuse=reuse, scope=scope, updates_collections=None, decay=0.9, center=True, scale=True)
and I feed is_training
and reuse
through the feed_dict, but now I get the error ValueError("The reuse parameter must be True or False or None.")
try to feed reuse as a python variable (input of the model) and as placeholder.
I tried that, and now it stopped complaining about the value... but I think that the placeholder value is not being used, because I see no change if I force values to batch_norm
function, and in TensorBoard it's not connected to the graph... (see attached image)
My code is like this now:
Batch Normalisation wrapper
def BatchNorm(inputT, is_training=False, reuse=None, scope=None):
with tf.variable_scope(scope):
return tf.contrib.layers.batch_norm(inputT, is_training=is_training, reuse=reuse, scope=scope, updates_collections=None, decay=0.9, center=True, scale=True)
Model definition
def model(data, train=False, is_training=False, reuse=None):
# 1st conv layer
with tf.name_scope('conv1') as scope:
conv = tf.nn.conv2d(
<...>
norm = BatchNorm(pool, is_training=is_training, reuse=reuse, scope=scope)
Training
feed_dict = {train_data_node: batch_data,
train_labels_node: batch_labels,
is_training: True,
reuse: None}
# Run the optimizer to update weights.
sess.run(optimizer, feed_dict=feed_dict)
Validation
batch_predictions = sess.run(eval_prediction, feed_dict={eval_data: data[-EVAL_BATCH_SIZE:, ...], is_training: False, reuse: True})
Although is_traning can a placeholder reuse has to be a bool, and it cannot be a tensor nor a placeholder.
I'm not sure what are you trying to do, in most cases using static values solve the problem. For example this pattern works well:
def model(data, is_training=False, reuse=None, scope='my_model'):
# Define a variable scope to contain all the variables of your model
with tf.variable_scope(scope, 'model', data, reuse=reuse):
# 1 layer
net = tf.contrib.layers.conv2d(data, ....)
....
net = tf.contrib.layers.batch_norm(net, is_training)
return net
train_outputs = model(train_data, is_training=True)
eval_outputs = model(eval_data, is_training=False, reuse=True)
eval_predictions = sess.run(eval_outputs, feed_dict={eval_data: data[-EVAL_BATCH_SIZE:, ...]})
Unless you need to change the behavior of the model dynamically, you don't need to use a placeholder for is_training. The trick is to build the model twice, but sharing the variables the second time.
Thank you @sguada ! After applying your suggestions, I finally made it to work!
It would be helpful if the API 1.0 documentation reflected that you need to manually add update ops to the graph. Being a newer tf user, I found that my test error was crazy and then had to spend a fair amount of time debugging my graph until I realized that batch normalization was the problem. Then I had to spend more time figuring out that by default the variables tracking the moments don't update unless you use a contrib function for optimization. Since in 1.0 there is no option to set the update_collections to None, there is no indicator from the documentation that this might even be an issue. Additionally, it seems like it might make sense to have a parameter to add the control flow dependencies to the op that runs in the training case.
@danrsc Exactly. The usage of BN layer is quite confusing. I suggested to add documents or a complete official tutorial on batch normalization, but unfortunately got no response = =
Completely agree. I think BN usage is very tricky and the documentation is currently beyond inadequate. This ought to be fixed for such a commonly used layer.
Reopening for visibility of the documentation issues.
@sguada assigning to you for triaging. Might be worth getting a tech writer on the case.
Just got confused by this problem last week and wasted 3 days of training... Hope the docs can be fixed soon, and an official batch normalization example can be added in the API docs.
@sguada I have noticed that you said" tf.contrib.layers.batch_norm can take tensor as is_training, so not need to do anything especial".
Howerver, the comment in the code is
If is_training
doesn't have a constant value, because it is a Tensor
,
# a Variable
or Placeholder
then is_training_value will be None and
# needs_moments
will be true.
Does it mean that nees_moments will be true even in test phase if i set is_training as a placeholder?
As far as I know, the moments is not needed while testing.
So if is_training
is a Variable
or a Placeholder
, it means it can change, so the graph to compute the moments is needed, so the layer builds it.
Then in running time depending on the value being True
or False
would use the batch moments
or the moving_mean
and moving_variance
.
So during testing you would set the value to False
and the moments
won't be used.
@sguada @brando90
def batch_norm_layer(self, x,train_phase, scope_bn):
bn_train = batch_norm(x, decay=0.9, center=False, scale=True,
updates_collections=None,
is_training=True,
reuse=None,
variables_collections= [UPDATE_OPS_COLLECTION],
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.9, center=False, scale=True,
updates_collections=None,
is_training=False,
reuse=True,
variables_collections= [UPDATE_OPS_COLLECTION],
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
I build batchnorm like this, however, the moving mean and moving variable are updated during test, I can not find the reason.
I tried creating two models like @sguada said, however, my model where is_training=False just crashes.
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key fully_connected_5/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key fully_connected_6/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key fully_connected_7/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key fully_connected_6/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key fully_connected_7/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key history_embeddings_1 not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key global_step_1 not found in checkpoint
I feel like maybe there should be a concrete example of how to do a batch norm with a fully connected net, as well as with CNNs. Sucks that I've trained models for days expecting things to work before seeing that everyone trying to use this feature going crazy.
Interestingly enough, it takes a zillion years to get the model restored after training with batch_norm as well. Will most likely wait until TF 2.0 to try something like this again.
@MisayaZ you don't need to create two batch_norm layers you can just pass train_phase (assuming it is a tf.bool) to batch_norm. Also you are passing UPDATE_OPS_COLLECTION variables_collections, which changes which collections are the variables added to.
The following should work:
z = batch_norm(x, decay=0.9, center=False, scale=True, updates_collections=None,
is_training=train_phase, scope=scope_bn)
@OktayGardener not sure what model are you trying to create, it seems that the variables are not saved in your checkpoint.
batch_norm also works with fully_connected layers.
slim = tf.contrib.slim
def model(data, is_training=False, reuse=None, scope='my_model'):
# Define a variable scope to contain all the variables of your model
with tf.variable_scope(scope, 'model', data, reuse=reuse):
# Configure arguments of fully_connected layers
with slim.arg_scope([slim.fully_connected],
activation_fn=tf.nn.relu,
normalizer_fn=slim.batch_nom):
# Configure arguments of batch_norm layers
with slim.arg_scope([slim.batch_norm],
decay=0.9, # Adjust decay to the number of iterations
update_collections=None, # Make sure updates happen automatically
is_training=is_training, # Switch behavior from training to non-training):
net = slim.fully_connected(data, 100, scope='fc1')
net = slim.fully_connected(net, 200, scope='fc2')
....
# Don't use activation_fn nor batch_norm in the last layer
net = slim.fully_connected(net, 10, activation_fn=None, normalizer_fn=None, scope='fc10')
return net
@sguada Thanks, I build a network with bathnorm which is implemented as you mentioned above
z = batch_norm(x, decay=0.9, center=False, scale=True, updates_collections=None,
is_training=train_phase, scope=scope_bn)
the speed is slow, I use tensorflow benchmark to get the computation time as below:
I tensorflow/core/util/stat_summarizer.cc:392] ============================== Top by Computation Time ==============================
I tensorflow/core/util/stat_summarizer.cc:392] [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [Name]
I tensorflow/core/util/stat_summarizer.cc:392] Conv2D 106.164 51.354 51.004 23.145% 23.145% 692.224 conv8/Conv2D
I tensorflow/core/util/stat_summarizer.cc:392] Conv2D 85.187 19.115 19.283 8.750% 31.896% 692.224 conv7/Conv2D
I tensorflow/core/util/stat_summarizer.cc:392] SquaredDifference 11.967 15.105 14.331 6.503% 38.399% 11075.584 conv1/batch_norm/moments/sufficient_statistics/SquaredDifference
I tensorflow/core/util/stat_summarizer.cc:392] Mul 11.970 14.162 13.495 6.124% 44.523% 11075.584 conv1/batch_norm/batchnorm/mul_1
I tensorflow/core/util/stat_summarizer.cc:392] Conv2D 3.948 8.170 7.986 3.624% 48.146% 11075.584 conv1/Conv2D
I tensorflow/core/util/stat_summarizer.cc:392] Sub 11.960 10.176 7.943 3.604% 51.751% 11075.584 conv1/batch_norm/moments/sufficient_statistics/Sub
I tensorflow/core/util/stat_summarizer.cc:392] SquaredDifference 45.570 5.908 7.177 3.257% 55.007% 5537.792 conv2/batch_norm/moments/sufficient_statistics/SquaredDifference
I tensorflow/core/util/stat_summarizer.cc:392] Mul 45.574 7.755 6.902 3.132% 58.140% 5537.792 conv2/batch_norm/batchnorm/mul_1
I tensorflow/core/util/stat_summarizer.cc:392] Conv2D 40.692 5.408 4.845 2.199% 60.338% 5537.792 conv2/Conv2D
I tensorflow/core/util/stat_summarizer.cc:392] Sub 45.563 6.067 4.784 2.171% 62.509% 5537.792 con
I don't understand why some op in moment are executed during test and it cost a lot of time, such as conv1/batch_norm/moments/sufficient_statistics/SquaredDifference.
The moment is not needed in test, why are some ops under moment executed?
Hi,
Using the above batch_norm
layer in contrib.layers
, I'm getting nan
as an output for validation graph while the train graph runs seamlessly. Is there anything that I might be missing ?
I'm using:
def batchnormlayer(inputs, numout, train_model):
with tf.variable_scope("batch_norm") as scope_bn:
epsilon = 1e-3
return tf.contrib.layers.batch_norm(inputs, decay=0.9, updates_collections=None,
scale=True, scope=scope_bn,
is_training=train_model, epsilon=epsilon,
fused=True, reuse=scope_bn.reuse)
Thanks
As a follow up, I'm reusing 16 layers of batch_norm.
However, I found that reusing 4 layers works.
I've just been noticing that if I kill the tensorflow process and restart it, my error gets worse for a few epochs (i.e. worse than it should be at the last checkpoint). I also observe that if I remove batch_norm, this problem goes away. After looking at the code for a while, I think this may be because the values of the variables are not restored from the shadow variables as they would be if the ExponentialMovingAverages class were used to manage the moving averages. This also means that if I use a separate process to evaluate, I'm getting whatever the last value of the variable was and not the moving average. Am I interpreting this correctly and is this the intended behavior? It seems like you want the shadow variable values to be restored...
I caught the problem, the moving variance in my case goes negative after some iterations.
The output of the tensor : Model/clip_logits/batch_norm/moving_variance:0
present in tf.model_variables()
is
Moving variance (shape = (101,)) =
[ 214.70379639 95.36338043 0.57885742 189.49542236 102.72473145
137.14886475 286.57333374 111.06427002 154.98750305 167.75219727
207.83955383 211.14007568 158.23495483 171.61665344 116.81361389
115.77380371 43.59399796 137.75064087 181.75245667 161.37339783
215.21934509 92.88521576 191.23846436 336.3946228 259.85919189
299.47039795 186.23222351 165.19311523 262.82446289 170.11567688
233.56843567 209.35050964 115.96807861 154.34109497 295.5770874
123.6055603 295.76187134 296.88583374 240.88217163 247.32983398
87.15661621 217.69897461 133.00698853 -4.80375671 344.77462769
291.50601196 117.77174377 265.83712769 207.90093994 194.186203
220.21418762 178.03738403 115.27571869 196.62184143 228.8089447
191.53205872 331.36807251 151.55435181 197.2951355 179.67504883
181.09727478 90.09922791 173.30133057 102.6836853 160.9434967
236.59512329 168.05305481 403.36340332 41.14326096 185.93409729
130.57434082 266.31509399 101.44387817 163.88059998 290.25015259
244.52597046 229.86647034 158.14352417 202.68774414 187.78227234
248.78218079 126.0978241 171.41891479 274.40740967 119.84254456
202.53045654 200.20608521 214.04730225 111.53284454 222.03184509
244.81187439 172.23052979 187.09806824 194.62802124 255.26345825
293.63598633 307.91036987 210.86982727 308.88919067 144.94792175
229.69013977]
As you can see, there's negative variance for one of the dimension. How is this even possible ?
P.S. The batch norm layer is used just after the last fully connected layer of the network and before softmax.
@raghavgoyal14 are you using it with fused=True? Had a similar problem and it went away when I used the fused version
@abred : Yes, I used fused=True
, same problem.
@sguada Hi, sguada, I have a problem.
The definition of contrib.layers.batch_norm in tensorflow:
def batch_norm(inputs,
decay=0.999,
center=True,
scale=False,
epsilon=0.001,
activation_fn=None,
param_initializers=None,
param_regularizers=None,
updates_collections=ops.GraphKeys.UPDATE_OPS,
is_training=True,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
batch_weights=None,
fused=False,
data_format=DATA_FORMAT_NHWC,
zero_debias_moving_mean=False,
scope=None,
renorm=False,
renorm_clipping=None,
renorm_decay=0.99):
scale: If True, multiply by gamma. If False, gamma is
not used. When the next layer is linear (also e.g. nn.relu), this can be
disabled since the scaling can be done by the next layer.
If I use tf.contrib.layers.batch_norm(input, scale=False) , the"scale =False" means whether the gamma is zero in "y = gamma*x+beta" while training. Thank you very much.
When scale=False, gamma is a constant 1.
@ppwwyyxx Thank you very much for your help. I use tf.contrib.layers.batch_norm(input, scale=False) in Tensorflow, and now I am convering the batchnorm of Tensorflow to Caffe. How to set the param of BatchNormLayer and ScaleLayer in Caffe?
Thank you very much.
@MisayaZ I was having the same behavior using Batchnorm with a placeholder for "is_training". I see in the trace that the moments are being calculated even at test time, so I decided to go into the source code and I found this:
# If `is_training` doesn't have a constant value, because it is a `Tensor`,
# a `Variable` or `Placeholder` then is_training_value will be None and
# `needs_moments` will be true.
is_training_value = utils.constant_value(is_training)
need_moments = is_training_value is None or is_training_value
if need_moments:
# here it defines the moments
It looks like when "is_training" is a variable or a placeholder the moments get defined and also get calculates them at runtime, even when you set the placeholder to "False". I would have preferred to leave it as a placeholder because this way I can do periodic testing during training without redefining the graph, but I decided to use it as a constant and define different behaviors for train vs test, and now the moments are not calculated at test time.
@tano297 Thank you. I now also use 'is_training' as a constant. Leave it as a placeholder and do periodic testing will change the value of moving mean and moving variance. And the inference time will be longer for it will calculate the mean and variance of the inputs and update the moving mean and moving variance. The right way to do testing is to define different behaviors for train and test as you mentioned.
@tano297 @MisayaZ
but doesn't the "smart_cond" in
is_training_value = utils.constant_value(is_training)
need_updates = is_training_value is None or is_training_value
if need_updates:
...
outputs = utils.smart_cond(is_training, _force_updates, no_updates)
make sure that the updates are only calculated and applied if is_training evaluates to True?
@abred Yes indeed, but you are referring to line 391, where it does the update of the moving average within _fused_batch_norm():
# If `is_training` doesn't have a constant value, because it is a `Tensor`,
# a `Variable` or `Placeholder` then is_training_value will be None and
# `need_updates` will be true.
is_training_value = utils.constant_value(is_training)
need_updates = is_training_value is None or is_training_value
if need_updates:
...
outputs = utils.smart_cond(is_training, _force_updates, no_updates)
...
I am talking about line 753 within batch_norm():
# If `is_training` doesn't have a constant value, because it is a `Tensor`,
# a `Variable` or `Placeholder` then is_training_value will be None and
# `needs_moments` will be true.
is_training_value = utils.constant_value(is_training)
need_moments = is_training_value is None or is_training_value
if need_moments:
...
mean, variance = utils.smart_cond(is_training,
_force_updates,
moving_vars_fn)
...
The smart condition in that case (as far as I am concerned) decides wether or not to update the moving averages, but the moments still get calculated.
@tano297 you right about that, I was in the wrong place, but still:
line 755-770 calculate the moments, but the moments are only used in _force_updates which is only executed if is_training evaluates to True, aren't they?
And thus
mean, variance = utils.smart_cond(is_training, _force_updates, moving_vars_fn)
should be equivalent to line 804:
mean, variance = moving_mean, moving_variance
if is_training evalutes to False and thus the "moments"-part of the graph is never used and thus shouldn't be executed
but I haven't tested, so I might be wrong about that :)
@tano297 @abred you right. The moving mean and moving variance are changed when I used batchnorm like this:
def batch_norm_layer(self, x,train_phase, scope_bn):
bn_train = batch_norm(x, decay=0.9, center=False, scale=True,
updates_collections=None,
is_training=True,
reuse=None,
variables_collections= [UPDATE_OPS_COLLECTION],
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.9, center=False, scale=True,
updates_collections=None,
is_training=False,
reuse=True,
variables_collections= [UPDATE_OPS_COLLECTION],
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
If you use like following:
z = batch_norm(x, decay=0.9, center=False, scale=True, updates_collections=None,
is_training=train_phase, scope=scope_bn)
The moving mean and moving variance will not be changed during test, but the speed is very slow.
Hi @zhongyuk ,
I also met the problem that I could get good results when using is_training=True for both training and inference, but get bad results when setting is_training=False during inference (worse than the case using is_training=True). According to your analysis, If I understand correctly, by simply setting decay=0.9 in BN can solve this problem. Am I right?
BTW, do I need to retrain the model using decay=0.9 from scratch? Or resuming training from the checkpoint (i.e., trained when decay=0.999) is also ok?
Thanks!
@nmduc @davek44
Hi, I also met the problem that I could get good results when using is_training=True for both training and inference, but get bad results when setting is_training=False during inference (worse than the case using is_training=True). Have you guys solved this problem? Thanks!
@tyshiwo I just set decay=0.9 for batch_norm and it works well so far.
I was confused after all these comments on how to properly use Batch Norm: So here is what I have. Please correct me if I'm wrong.
batch_norm = tf.contrib.layers.batch_norm(conv,
center=True,
scale=True,
reuse=phase_train_py,
scope='bn',
is_training=is_training)
where phase_train_py is a python boolean variable and is_training is a placeholder taking a boolean variable. I guess using tf.cond is wrong, otherwise would did the function came with a boolean parameters. In other words, if tf.cond
is true, then we should a batch_norm
function for training and another one for testing. So, developers allow us to change these boolean variables in order to change the behavior of the function. So What I am doing is: setting phase_train_py
to False while training while is_training
to True. And the opposite while Testing. Since we can only change tensors or placeholders with sess.run
, I changed phase_train_py
intentionally before running the graph. Ex:
if condition:
phase_train_py = False
sess.run(to_run_list, feed_dict={phase_train: True})
else:
phase_train_py = True
sess.run(to_run_list, feed_dict={phase_train: False})
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
MAYBE YOU NEED READ THIS
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
It seems there are still problems with TF v1.3. I'm sure I note the following details, but still failed to use the official tf.contrib.layers.batch_norm
, with is_training=False
during evaluation(but when I keep is_training=True
unchanged during evaluation, it is ok):
1.decay
, exponential moving average is actually alpha filter in signal processing, the time to converge is approximately 1/(1-decay) steps of train. For decay=0.999, you need 1/0.001=1000 steps to converge. So set the appropriate decay for your training step numbers.
updates_collections=None
if you don't want to add control dependencies of update op to train_opreuse
to appropriate value.It seems the only way to use the official batch_norm is to build two graphs, one for train and one for evaluation, with is_training=True
and is_training=False
, respectively. In this way, you don't need to switch dynamically between train and evaluation. But this is a stupid way since you need to build more than one graph.
Finally, I write a moving average by myself, and I find it worked! It's as follows(based on code on the web and modified by myself)
def bn_layer(x, scope, is_training, epsilon=0.001, decay=0.99, reuse=None):
"""
Performs a batch normalization layer
Args:
x: input tensor
scope: scope name
is_training: python boolean value
epsilon: the variance epsilon - a small float number to avoid dividing by 0
decay: the moving average decay
Returns:
The ops of a batch normalization layer
"""
with tf.variable_scope(scope, reuse=reuse):
shape = x.get_shape().as_list()
# gamma: a trainable scale factor
gamma = tf.get_variable("gamma", shape[-1], initializer=tf.constant_initializer(1.0), trainable=True)
# beta: a trainable shift value
beta = tf.get_variable("beta", shape[-1], initializer=tf.constant_initializer(0.0), trainable=True)
moving_avg = tf.get_variable("moving_avg", shape[-1], initializer=tf.constant_initializer(0.0), trainable=False)
moving_var = tf.get_variable("moving_var", shape[-1], initializer=tf.constant_initializer(1.0), trainable=False)
if is_training:
# tf.nn.moments == Calculate the mean and the variance of the tensor x
avg, var = tf.nn.moments(x, np.arange(len(shape)-1), keep_dims=True)
avg=tf.reshape(avg, [avg.shape.as_list()[-1]])
var=tf.reshape(var, [var.shape.as_list()[-1]])
#update_moving_avg = moving_averages.assign_moving_average(moving_avg, avg, decay)
update_moving_avg=tf.assign(moving_avg, moving_avg*decay+avg*(1-decay))
#update_moving_var = moving_averages.assign_moving_average(moving_var, var, decay)
update_moving_var=tf.assign(moving_var, moving_var*decay+var*(1-decay))
control_inputs = [update_moving_avg, update_moving_var]
else:
avg = moving_avg
var = moving_var
control_inputs = []
with tf.control_dependencies(control_inputs):
output = tf.nn.batch_normalization(x, avg, var, offset=beta, scale=gamma, variance_epsilon=epsilon)
return output
def bn_layer_top(x, scope, is_training, epsilon=0.001, decay=0.99):
"""
Returns a batch normalization layer that automatically switch between train and test phases based on the
tensor is_training
Args:
x: input tensor
scope: scope name
is_training: boolean tensor or variable
epsilon: epsilon parameter - see batch_norm_layer
decay: epsilon parameter - see batch_norm_layer
Returns:
The correct batch normalization layer based on the value of is_training
"""
#assert isinstance(is_training, (ops.Tensor, variables.Variable)) and is_training.dtype == tf.bool
return tf.cond(
is_training,
lambda: bn_layer(x=x, scope=scope, epsilon=epsilon, decay=decay, is_training=True, reuse=None),
lambda: bn_layer(x=x, scope=scope, epsilon=epsilon, decay=decay, is_training=False, reuse=True),
)
Just use the bn_layer_top
function during building a graph, the is_training parameter is a tf.placeholder
. Then you are free to switch the placeholder to True during train and False during evaluation, with feed_dict
.
Hope it helps the community.
When you use slim.batch_norm,be sure to use "slim.learning.create_train_op" instead of "tf.train.GradientDecentOptimizer(lr).minimize(loss)" or other optimizer. Try it to see if it works!
@vincentvanhoucke You wrote in another post in this thread:
The slim batch_norm wrapper normalizes over the last dimension of your input tensor. So if it's a 2D input tensor coming from a fully connected layer, it normalizes over batch, and thus performs per-activation normalization. If it's a 4D tensor coming from a convolution, it will normalize over the three first dimensions (batch, width, depth), and thus perform per-feature normalization. @sguada maybe forth being a bit more descriptive about this.
Do you mean with "slim batch_norm wrapper" the function tf.contrib.layers.batch_norm
? If so, I would suggest to add this information to the documentation text of this function. Thus it gets very clear, that this function performs the batch normalization exactly like described in the paper... for both FC-Layer and Conv2D-Layer. At the moment there is only the text "Can be used as a normalizer function for conv2d and fully_connected.", where it is not clear if this is related to the normalization axis topic.
@ZahlGraf I'll happily consider a PR that clarifies the documentation. We've been at this for so long that I no longer have a good sense of what's obvious or not, and would welcome clarifying documentation for someone with a fresh perspective on the topic.
@vincentvanhoucke
I created a PR with a more detailed description, mainly based on your statement in this thread:
https://github.com/tensorflow/tensorflow/pull/15653
Please remove the assignee, as this issue is inviting external contributions. Otherwise, remove the contributions welcome
label. Thank you.
Please remove the assignee, as this issue is inviting external contributions. Otherwise, remove the contributions welcome
label. Thank you.
Closing this bug since the original request to add a batch norm layer has been addressed. Some of the more recent issues with documentation seem to have their own PRs
If you see any issue with batch_norm, please either ask a question on StackOverflow or open another issue.
Most helpful comment
There is now a
batch_norm
layer:https://github.com/tensorflow/tensorflow/blob/b826b79718e3e93148c3545e7aa3f90891744cc0/tensorflow/contrib/layers/python/layers/layers.py#L100