## How to set specific gpu in bert? - tensorflow

ResourceExhaustedError (see above for traceback):
OOM when allocating tensor of shape [768] and type float [[node
bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Initializer/zeros
(defined at /home/zyl/souhu/bert/optimization.py:122) =
Const_class=["loc:#bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Assign"],
dtype=DT_FLOAT, value=Tensor, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
How to set gpu 1 or another to run bert?

The easiest way to set what GPUs will be used is setting CUDA_VISIBLE_DEVICES environment variable. It will still be GPU:0 TensorFlow, different physically different device.
If you are using BERT within Python (which is rather a painful way), you can use the code that is creating BERT graph in a block:
with tf.device('/device:GPU:1'):
model = modeling.BertModel(...)

## Related

### tf.datasets input_fn getting error after 1 epoch

So I am trying to switch to an input_fn() using tf.datasets as described in this question. While I have been able to get superior steps/sec using tf.datasets with the input_fn() below, I appear to run into an error after 1 epoch when running this experiment on GCMLE. Consider this input_fn(): def input_fn(...): files = tf.data.Dataset.list_files(filenames).shuffle(num_shards) dataset = files.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1), cycle_length=num_shards)) dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda row: parse_csv_dataset(row, hparams = hparams), batch_size = batch_size, num_parallel_batches = multiprocessing.cpu_count())) dataset = dataset.prefetch(1) if shuffle: dataset = dataset.shuffle(buffer_size = 10000) dataset = dataset.repeat(num_epochs) iterator = dataset.make_initializable_iterator() features = iterator.get_next() tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer) labels = {key: features.pop(key) for key in LABEL_COLUMNS} return features, labels I receive the following error on GCMLE: disable=protected-access InvalidArgumentError (see above for traceback): Inputs to operation loss/sparse_softmax_cross_entropy_loss/num_present/Select of type Select must have the same size and shape. Input 0: [74] != input 1: [110] [[Node: loss/sparse_softmax_cross_entropy_loss/num_present/Select = Select[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss/sparse_softmax_cross_entropy_loss/num_present/Equal, loss/sparse_softmax_cross_entropy_loss/num_present/zeros_like, loss/sparse_softmax_cross_entropy_loss/num_present/ones_like)]] [[Node: global_step/add/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3099_global_step/add", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] This implies that there is a shape mismatch Input 0: [74] != input 1: [110], however my old queue based input_fn() works fine on the same exact data, so I do not believe it is any issue with the underlying data. This is taking place at what I believe to be the end of the epoch (because the num_steps when th GCMLE error ends is right around th num_train_examples/batch_size so I am guessing that the issue might be that the final batch is not equal the batch_size which is 110 (as it shows up in the error) and instead there are only 74 examples. Can anybody confirm that this is the error? Assuming that it is, is there some other flag that I need to set so that the last batch can be something other than the spcified batch size of 110? For what it's worth, I have replicated this behavior with two different datasets (trains for multiple epochs with the old queue based input_fn, gets hung up at end of first epoch for the tf.datasets input_fn)

As Robbie suggests in the other answer, it looks like your old implementation used fixed batch sizes throughout (presumably using an API like tf.train.batch() or one of its wrappers with the default argument of allow_smaller_final_batch=False), and the default behavior of batching in tf.data (via tf.data.Dataset.batch() and tf.contrib.data.map_and_batch()) is to include the smaller final batch. The bug is most likely in the model_fn. Without seeing that function, it is difficult to guess, but I suspect that there is either an explicit (and incorrect) assertion of a tensor's shape via Tensor.set_shape() (possibly in library code) or a bug in the implementation of tf.losses.sparse_softmax_cross_entropy(). First, I am assuming that the features and labels tensors returned from input_fn() have statically unknown batch size. Can you confirm that by printing the features and labels objects, and ensuring that their reported Tensor.shape properties have None for the 0th dimension? Next, locate the call to tf.losses.sparse_softmax_cross_entropy() in your model_fn. Print the object that is passed as the weights argument to this function, which should be a tf.Tensor, and locate its static shape. Given the error you are seeing, I suspect it will have a shape like (110,), where 110 is your specified batch size. If that is the case, there is a bug in model_fn that incorrectly asserts that the shape of the weights is a full batch, when it might not be. (If that is not the case, then there's a bug in tf.losses.sparse_softmax_cross_entropy()! Please open a GitHub issue with an example that enables us to reproduce the problem.) Aside: Why would this explain the bug? The code that calls the failing tf.where() op looks like this (edited for readability): num_present = tf.where(tf.equal(weights, 0.0), # This input is shape [74] tf.zeros_like(weights), # This input is shape [110] tf.ones_like(weights) # This input is probably [110] ) This flavor of tf.where() op (named "Select" in the error message for historical reasons) requires that all three inputs have the same size. Superficially, tf.equal(weights, 0.0), tf.ones_like(weights), and tf.zeros_like(weights) all have the same shape, which is the shape of weights. However, if the static shape (the result of Tensor.shape) differs from the dynamic shape, then the behavior is undefined. What actually happens? In this particular case, let's say the static shape of weights is [110], but the dynamic shape is [74]. The static shape of our three arguments to tf.where() will be [110]. The implementation of tf.equal() doesn't care that there's a mismatch, so its dynamic shape will be [74]. The implementations of tf.zeros_like() and tf.ones_like() use an optimization that ignores that dynamic shape when the static shape is fully defined, and so their dynamic shapes will be [110], causing the error you are seeing. The proper fix is to locate the code that is asserting a fixed batch size in your model_fn, and remove it. The optimization and evaluation logic in TensorFlow is robust to variable batch sizes, and this will ensure that all of your data is used in the training and evaluation processes. A less desirable short-term fix would be to drop the small batch at the end of the data. There are a couple of options here: Drop some data randomly at the end of each epoch: With TF 1.8 or later, pass drop_remainder=False to tf.contrib.data.map_and_batch(). With TF 1.7 or earlier, use dataset = dataset.filter(lambda features: tf.equal(tf.shape(features[LABEL_COLUMNS[0]])[0], batch_size)) after the map_and_batch. Drop the very last batch of data: Move the dataset.repeat(NUM_EPOCHS) before the map_and_batch() and then apply one of the two fixes mentioned above.

It seems that some operation in your graph (from the error message, likely sparse_softmax_cross_entropy_loss), is expecting a fixed batch size. It may be your code (not part of the input_fn) that is enforcing this (e.g. passing batch_size as the shape of some tensor that is used in an op), or it may be one of the TF libraries. This is not always a problem per se. However, the fact that the documented behavior of tf.data.Dataset.batch is: NOTE: If the number of elements (N) in this dataset is not an exact multiple of batch_size, the final batch contain smaller tensors with shape N % batch_size in the batch dimension. If your program depends on the batches having the same shape, consider using the tf.contrib.data.batch_and_drop_remainder transformation instead. As currently written your (non-input_fn) code is in the category of depending on the batch with the same shape. Your options are to track down where the code is passing through a static batch size or to "drop the remainder". I believe the former is preferable, but more work. If you choose the latter, note that you are not actually using tf.data.Dataset.batch, but rather tf.contrib.data.map_and_batch which accepts a drop_remainder parameter.

### tensorflow summary ops can assign to gpu

Here is part of my code. with tf.Graph().as_default(), tf.device('/cpu:0'): global_step = tf.get_variable( 'global_step', [], initializer = tf.constant_initializer(0), writer = tf.summary.FileWriter(logs_path,graph=tf.get_default_graph()) with tf.device('/gpu:0'): tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE) summary_op = tf.summary.merge_all() when I run it. I will get following error: InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'learning_rate': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: learning_rate = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](learning_rate/tags, learning_rate/values)]] if I move these 2 ops into tf.device("/cpu:0") device scope, It will work. tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE) summary_op = tf.summary.merge_all() I google it. there are many suggestiones about using "allow_soft_placement=True". But I think this solution is basically change device scope automatically. So my question is: why these 2 ops can not assign to gpu? Is there any documents I can look at to figure out what ops can or cannot assign to gpu? any suggestion is welcome.

You can't assign a summary operation to a GPU because is meaningless. In short, a GPU executes parallel operations. A summary is nothing but a file in which you append new lines every time you write on it. It's a sequential operation that has nothing in common with the operation that GPUs are capable to do.

Your error says it all: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. That operation (in the tensorflow version you're using) has no GPU implementation and thus must be sent to a CPU device.

### How to restore a saved variable in tensorflow?

I am trying to restore a saved variable in tensorflow. Seems like it is very very complicated. I use the alexnet implementation in http://www.cs.toronto.edu/~guerzhoy/tf_alexnet/ in a python file, alexnet.py, I define the variable conv5W = tf.Variable(net_data["conv5"][0],name='conv5w') then, I finetune the model and I see that some of its values are changed. I save the finetuned model by typing: saver = tf.train.Saver() saver.save(sess,"modelname.ckpt") after that, I open a new ipython console and run: from alexnet import * sess=tf.InteractiveSession() new_saver = tf.train.import_meta_graph("modelname.ckpt.meta") new_saver.restore(sess, "modelname.ckpt") after that, when i try to retrieve the values of the variables with: conv5W.eval(session=sess) it yields: FailedPreconditionError: Attempting to use uninitialized value conv5w [[Node: conv5w/_98 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_4_conv5w", _device="/job:localhost/replica:0/task:0/gpu:0"](conv5w)]] [[Node: conv5w/_99 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_4_conv5w", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] on the other hand, if I initialize variable with: init = tf.initialize_all_variables() sess.run([init]) , this time it yields the initial values in net_data["conv5"][0], not the finetuned ones

Restoring from the meta graph prepares the graph, not the data. Restoring data requires adding at training time the values you want to restore to collection objects, and reload these collections at restore time. The official tutorial shows how (in fact, there is another way, see below). Another way would be to restore the graph (tf.write_graph and tf.import_graph_def), then restore all variables from a checkpoint. The official tutorials seem to lead more toward this checkpoint approach (see link above). The meta graph is rather aimed for distributed processing, which requires more work and care.

Eric has answered most of your points. I faced a similar problem and a simple workaround to it is: Re-load either the entire graph or import its meta graph (former is recommended if you are a newbie). You still haven't run the restore function Start your session and initialize all variables Restore (using tf.train.Saver) the checkpoint The issue with your case is that when you run tf.initialize_all_variables() after restoring, tensorflow resets them to initial values and you loose your fine-tuned weights.

### Why does setting an initialization value prevent placing a variable on a GPU in TensorFlow?

I get an exception when I try to run the following very simple TensorFlow code, although I virtually copied it from the documentation: import tensorflow as tf with tf.device("/gpu:0"): x = tf.Variable(0, name="x") sess = tf.Session() sess.run(x.initializer) # Bombs! The exception is: tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'x': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. If I change the variable's initial value to tf.zeros([1]) instead, everything works fine: import tensorflow as tf with tf.device("/gpu:0"): x = tf.Variable(tf.zeros([1]), name="x") sess = tf.Session() sess.run(x.initializer) # Works fine Any idea what's going on?

This error arises because tf.Variable(0, ...) defines a variable of element type tf.int32, and there is no kernel that implements int32 variables on GPU in the standard TensorFlow distribution. When you use tf.Variable(tf.zeros([1])), you're defining a variable of element type tf.float32, which is supported on GPU. The story of tf.int32 on GPUs in TensorFlow is a long one. While it's technically easy to support integer operations running on a GPU, our experience has been that most integer operations actually take place on the metadata of tensors, and this metadata lives on the CPU, so it's more efficient to operate on it there. As a short-term workaround, several kernel registrations for int32 on GPUs were removed. However, if these would be useful for your models, it would be possible to add them as custom ops. Source: In TensorFlow 0.10, the Variable-related kernels are registered using the TF_CALL_GPU_NUMBER_TYPES() macro. The current "GPU number types" are tf.float16, tf.float32, and tf.float64.

### RNN model running out of memory in TensorFlow

I implemented a Sequence to Sequence model using the rnn.rnn helper in TensorFlow. with tf.variable_scope("rnn") as scope, tf.device("/gpu:0"): cell = tf.nn.rnn_cell.BasicLSTMCell(4096) lstm = tf.nn.rnn_cell.MultiRNNCell([cell] * 2) _, cell = rnn.rnn(lstm, input_vectors, dtype=tf.float32) tf.get_variable_scope().reuse_variables() lstm_outputs, _ = rnn.rnn(lstm, output_vectors, initial_state=cell) The model is running out of memory on a Titan X with 16 GB of memory while allocating gradients for the LSTM cells: W tensorflow/core/kernels/matmul_op.cc:158] Resource exhausted: OOM when allocating tensor with shape[8192,16384] W tensorflow/core/common_runtime/executor.cc:1102] 0x2b42f00 Compute status: Resource exhausted: OOM when allocating tensor with shape[8192,16384] [[Node: gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/concat, gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/add_grad/tuple/control_dependency)]] If I reduce the length of the input and output sequences to 4 or less the model runs without a problem. This indicates to me that TF is trying to allocate the gradients for all time steps at the same time. Is there a way of avoiding this?

The function tf.gradients as well as the minimize method of the optimizers allow you to set parameter called aggregation_method. The default value is ADD_N. This method constructs the graph in such a way that all gradients need to be computed at the same time. There are two other undocumented methods called tf.AggregationMethod.EXPERIMENTAL_TREE and tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N, which do not have this requirement.