Tensorflow: How to compile tensorflow using SSE4.1, SSE4.2, and AVX.

Created on 3 Mar 2017  ·  44Comments  ·  Source: tensorflow/tensorflow

Just got tensorflow running. Now running into this error.

Currently using Mac Yosemite, downloaded tensorflow using pip3 through anaconda, using python 3.5.

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

So since anaconda has a special set of commands, how do you get tensorflow to run on SSE4.1, SSE4.2, and AVX via the anaconda command system ? I am really confused how to go about this.

Most helpful comment

This isn't an error, just warnings saying if you build TensorFlow from source it can be faster on your machine.

SO question about this: http://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
TensorFlow guide to build from source: https://www.tensorflow.org/install/install_sources

All 44 comments

This isn't an error, just warnings saying if you build TensorFlow from source it can be faster on your machine.

SO question about this: http://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
TensorFlow guide to build from source: https://www.tensorflow.org/install/install_sources

Just as @Carmezim stated these are simply warning messages.
For each of your programs, you will only see them once.
And just like the warnings say, you should only compile TF with these flags if you need to make TF faster.

You can follow our guide to install TensorFlow from sources to compile TF with support for SIMD instruction sets.

Ok, thanks. I get it.

Is there a way we can silence this?

The only way to silence these warning messages is to build from sources, using --config opt option.

A sort of "workaround" (albeit imperfect) that redirects the messages on Unix/Linux/OSX:
python myscript.py 2>/dev/null

@CGTheLegend @ocampesato you can use TF environment variable TF_CPP_MIN_LOG_LEVEL and it works as follows:

  • It defaults to 0, displaying all logs
  • To filter out INFO logs set it to 1
  • WARNINGS additionally, 2
  • and to additionally filter out ERROR logs set it to 3

So you can do the following to silence the warnings:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

@gunan @mrry I've seen many folks interested in silencing the warnings, would there be interest in adding this kind of info to the docs?

I install from tensorflow install guide, also got this warning.

pip3 install --upgrade tensorflow

@jadeydi Instead of compiling from source, "pip" just install the binary as well, so that you'll still got these warnings.

I just compiled tensorflow with support for SSE4.1 SSE4.2 AVX AVX2 and FMA. The build is available here: https://github.com/lakshayg/tensorflow-build . I hope this is useful.

Hi @lakshayg, thanks for sharing. You might want to check https://github.com/yaroslavvb/tensorflow-community-wheels

Approximately much faster is the build compared to the standard pip install tensorflow-gpu on Ubuntu? Is it only faster for CPU computations, or is there any benefit to GPU computations?

http://www.anandtech.com/show/2362/5

This came up on google and has some decent technical details.

test is a DivX encode using VirtualDub 1.7.6 and DivX 6.7. SSE4 comes in if you choose to enable a new full search algorithm for motion estimation, which is accelerated by two SSE4 instructions: MPSADBW and PHMINPOSUW. The idea is that motion estimation (figuring out what will happen in subsequent frames of video) requires a lot of computation of sums of absolute differences, as well as finding the minimum values of the results of those computations. The SSE2 instruction PSADBW can compute two sums of differences from a pair of 16B unsigned integers; the SSE4 instruction MPSADBW can do eight.

...

On our QX9650, the full search with SSE4 enabled runs about 45% faster than with SSE2 only

Now sure what functions tensorflow is using, but might be worth the effort.

Sorry but this is a ridiculous thing to have output in all TF scripts by default. Most people probably aren't compiling TF from source nor want to.

@TomAshley303, this is a pretty awesome info to get! I don't plan to recompile from source. I don't want to. But the info tells me what to do if my model becomes big and slow and will need a performance boost. It's usually cheaper to recompile with extensions than to buy new hardware, given that having good walkthroughts (which we do have) minimizes the labour cost of recompiling (CPU time does not matter, can run overnight).

I went through the process... Was straight-forward and took no time at all. Not your usual cmake C++ kinda nightmare.

I have a small bash script to compile TF under MacOS/Linux. It dynamically calculates CPU features and put them as the build parameters. I was thinking to create a PR but didn't find a folder with scripts (helpers) for local builds, only ci_build. If it makes sense I will do it

gist
https://gist.github.com/venik/9ba962c8b301b0e21f99884cbd35082f

A note to @gunan

I've encountered this issue when I was installing TensorFlow for the first time. Now I am having to figure out how to resolve it again because I'm installing TensorFlow on a new machine. It's a pain in the neck, and the documentation you've provided is not clear at all.

The fact that I have to do it on my end is ridiculous and infuriating. It's no good making something available from pip/pip3 if it then just throws warnings at you all day.

At the very least, you should edit https://www.tensorflow.org/install/install_sources and explicitly explain how to compile it with SSE / AVX

The solution that worked for me: input "-mavx -msse4.1 -msse4.2" when prompted during the configuration process (when you run ./configure).

Is it that hard to add this to your installation instructions?

Thank you, according to @Carmezim answer, I get the cpu speed up version based on avx and sse. I'v tested faster-rcnn(resnet-101) on Intel. Cost time speeds up about 30%, it is truly useful.

You can silence the warnings.
Just add these codes at the top.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
As mentioned here: https://stackoverflow.com/a/44984610

you could easily add a user variable in System Environment Variable:
TF_CPP_MIN_LOG_LEVEL, value = 2. Then restart your IDE

@mikalyoung improvements for GPU computations cannot be expected, since those instructions set are CPU only, and they allow for vectorized operations.
So if you compare two codes running (ideally) 100% on GPUs, one on a Tensorflow instance compiled with SIMD support and one without, you should get the same results in terms of speed (and hopefully numerically also).

I C:\tf_jenkinshome\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

As You can see the warning is with my system also but In that, I am not able to understand 'I' in the starting of the Warning so someone can help me in that case

"I" there just is a shorthand for "INFO". The other letters you can see there are E for error, or F for fatal.

So I installed using conda. If I wish now to compile from source instead to take advantage of any speed boost, do I need to do anything to remove my conda install of tensorflow? Or is it in its own little container and I can separately compile from source?

I had installed DeepSpeech and also a DeepSpeech server. Went to start the server and got an error message - "2018-01-17 08:21:49.120154: F tensorflow/core/platform/cpu_feature_guard.cc:35] The TensorFlow library was compiled to use AVX2 instructions, but these aren't available on your machine.
Aborted (core dumped)"

Apparently I need to compile TensorFlow on the same computer. Is there a list somewhere to match Kubuntu 17.10.1 and a HP Probook 4330S please ?

Why are there no windows compiles? I am having the same issues, but instead of muting the warnings I would like to use my GPU, I also have an and graphics card and not Nvidia what do I do?

*I Do not have a Nvidia graphics card, I have an and one what do I do?

*AMD graphics card.. autocorrect

These are not merely warnings as it kills the process on my test boxes. Since I also use AMD GPUs I spun up a Digital Ocean tensorflow box to give this a go, but it seems there is no GPU support there either, and it's failing miserably.

`# Job id 0

Loading hparams from /home/science/tf-demo/models/nmt-chatbot/model/hparams

saving hparams to /home/science/tf-demo/models/nmt-chatbot/model/hparams
saving hparams to /home/science/tf-demo/models/nmt-chatbot/model/best_bleu/hparams
attention=scaled_luong
attention_architecture=standard
batch_size=128
beam_width=10
best_bleu=0
best_bleu_dir=/home/science/tf-demo/models/nmt-chatbot/model/best_bleu
check_special_token=True
colocate_gradients_with_ops=True
decay_factor=1.0
decay_steps=10000
dev_prefix=/home/science/tf-demo/models/nmt-chatbot/data/tst2012
dropout=0.2
encoder_type=bi
eos=
epoch_step=0
forget_bias=1.0
infer_batch_size=32
init_op=uniform
init_weight=0.1
learning_rate=0.001
learning_rate_decay_scheme=
length_penalty_weight=1.0
log_device_placement=False
max_gradient_norm=5.0
max_train=0
metrics=['bleu']
num_buckets=5
num_embeddings_partitions=0
num_gpus=1
num_layers=2
num_residual_layers=0
num_train_steps=500000
num_translations_per_input=10
num_units=512
optimizer=adam
out_dir=/home/science/tf-demo/models/nmt-chatbot/model
output_attention=True
override_loaded_hparams=True
pass_hidden_state=True
random_seed=None
residual=False
share_vocab=False
sos=
source_reverse=False
src=from
src_max_len=50
src_max_len_infer=None
src_vocab_file=/home/science/tf-demo/models/nmt-chatbot/data/vocab.from
src_vocab_size=15003
start_decay_step=0
steps_per_external_eval=None
steps_per_stats=100
subword_option=
test_prefix=/home/science/tf-demo/models/nmt-chatbot/data/tst2013
tgt=to
tgt_max_len=50
tgt_max_len_infer=None
tgt_vocab_file=/home/science/tf-demo/models/nmt-chatbot/data/vocab.to
tgt_vocab_size=15003
time_major=True
train_prefix=/home/science/tf-demo/models/nmt-chatbot/data/train
unit_type=lstm
vocab_prefix=/home/science/tf-demo/models/nmt-chatbot/data/vocab
warmup_scheme=t2t
warmup_steps=0

creating train graph ...

num_bi_layers = 1, num_bi_residual_layers=0
cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t
decay_scheme=, start_decay_step=0, decay_steps 10000, decay_factor 1

Trainable variables

embeddings/encoder/embedding_encoder:0, (15003, 512),
embeddings/decoder/embedding_decoder:0, (15003, 512),
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512),
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0
dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003), /device:GPU:0

creating eval graph ...

num_bi_layers = 1, num_bi_residual_layers=0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0

Trainable variables

embeddings/encoder/embedding_encoder:0, (15003, 512),
embeddings/decoder/embedding_decoder:0, (15003, 512),
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512),
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0
dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003), /device:GPU:0

creating infer graph ...

num_bi_layers = 1, num_bi_residual_layers=0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0

Trainable variables

embeddings/encoder/embedding_encoder:0, (15003, 512),
embeddings/decoder/embedding_decoder:0, (15003, 512),
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512),
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0
dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 15003),

log_file=/home/science/tf-demo/models/nmt-chatbot/model/log_1519669184

2018-02-26 18:19:44.862736: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Killed`

what command needs to run and where to run these commands and how please tell. I desperately need help.

But does it mean that the system is not using GPU for the process?

Well you need to resolve this if you are building tensorflow in an acceleration environment such as using k-fold in the KerasClassifier.
To resolve this you will need to build tensorflow from source just as everyone recommends.
To build tensorflow from source you will need to have the following tool

  1. Install git on you machine if you haven't down so already - on ubuntu machine just type "sudo apt-get install git
  2. You will need to install bazel. It is highly recommended to use the custom APT repository. Follow the instruction on this link to install bazel https://docs.bazel.build/versions/master/install-ubuntu.html.
  3. You need to the following python dependencies... using the command below
    numpy, dev, and wheel
    sudo apt-get install python-numpy python-dev python-pip python-wheel
    4.Once you have all the dependencies installed, clone the tensorflow github to your local drive
    git clone https://github.com/tensorflow/tensorflow
  4. Go to the location to clone tensorflow and cd to the tensorflow file and run the config file
    cd tensor
    ./configure

Just follow the instruction on the screen to complete tensorflow installation.
I will highly recommend to update your machine once tensorflow is installed
sudo apt-get update

Good luck and enjoy...

Just chiming in on this thread that you shouldn't just silence these warnings - I'm getting about 43% faster training time by building from source, I think it's worth the effort.

how to install tensorflow using this file" tensorflow-1.6.0-cp36-cp36m-win_amd64.whl"

@anozele pip3 install --upgrade *path to wheel file*

@gunan --config=opt is not enough, you should also add, e.g., --copt="-msse4.2", when you build TensorFlow from source.

According to Intel, https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide, If you use intel built Tensorflow, you can ignore those warning since all available instruction set would be used by the backend MKL. Can anyone from Tensorflow confirm this?

This isn't an error, just warnings saying if you build TensorFlow from source it can be faster on your machine.

SO question about this: http://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions
TensorFlow guide to build from source: https://www.tensorflow.org/install/install_sources

However,it is not faster than i do not use -FMA -AVX -SSE https://stackoverflow.com/questions/57197854/fma-avx-sse-flags-did-not-bring-me-good-performance

Hi. Sorry if I am beating a dead horse. Just wondering why is the default pip wheel not the binaries compiled with advance instructions?

Hi. Sorry if I am beating a dead horse. Just wondering why is the default pip wheel not the binaries compiled with advance instructions?

This is because old cpu architectures don't support advanced instruction set. See wiki for the detailed list of cpus supporting AVX, AVX2 or AVX512. If the default pip binary is compiled with these instruction sets then tensorflow cannot work on old CPUs.

But does it mean that the system is not using GPU for the process?

Nope, It shows even if you are using GPU, if you haven`t silenced the messages you should also see Tensorflor loading your GPU device in command prompt.

If you check with this repo:
请查看下面代码:

https://github.com/fo40225/tensorflow-windows-wheel

He has compiled almost all version of TF with SSE and AVX
他已经将几乎所有TF版本编译好了!

This article was a good tutorial on how to build from source including the flags
https://medium.com/@pierreontech/setup-a-high-performance-conda-tensorflow-environment-976995158cb1

try forcing the inclusion of the appropriate extensions using additional bazel options like --copt=-mavx --copt=-msse4.1 --copt=-msse4.2

Was this page helpful?
0 / 5 - 0 ratings

Related issues

indiejoseph picture indiejoseph  ·  3Comments

waleedka picture waleedka  ·  3Comments

nicholaslocascio picture nicholaslocascio  ·  3Comments

RahulKulhari picture RahulKulhari  ·  3Comments

ppwwyyxx picture ppwwyyxx  ·  3Comments