Tensorflow: How to initialize embeddings layer within Estimator API?

Created on 12 Jan 2018  ·  3Comments  ·  Source: tensorflow/tensorflow

I'm trying to use existing embeddings within tensorflow model, the size of embedding is greater than 2Gb and this makes my original try of doing this unsuccessful:

embedding_var = tf.get_variable(
        "embeddings", 
        shape=GLOVE_MATRIX.shape, 
        initializer=tf.constant_initializer(np.array(GLOVE_MATRIX))
)

Which gave me this error:

Cannot create a tensor proto whose content is larger than 2GB.

I'm using AWS SageMaker, which based on the Estimator API, and the actual running of the graph in session happens behind the scene, so I'm not sure how to initialize some placeholders for embedding given that. Would be helpful if someone will be able to share the way how to do such initialization in term of EstimatorAPI.


Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug or a feature request.
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below):
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

awaiting response bug

Most helpful comment

looks like there the right way to initialize variables with embeddings would be to use tf.train.Scaffold. Here is more information regarding this on stackoverflow

All 3 comments

I think this would normally be a "send to StackOverflow" (standard response appended below) kind of issue, but the 2GB limit seems like it's within range of a bug or a feature request.

@martinwicke @ispirmustafa any suggestions?

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!

I think it's related to graph size limit. using constant_initializer embeds the GLOVE_MATRIX into the graph which increases the graph size.
Could you please try to use non constant initializer?

looks like there the right way to initialize variables with embeddings would be to use tf.train.Scaffold. Here is more information regarding this on stackoverflow

Was this page helpful?
0 / 5 - 0 ratings