Skip to content

Distributed training with tensorflow doesn't work with Keras 3.7 and above #21172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DarrenR96 opened this issue Apr 16, 2025 · 4 comments
Open
Assignees
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug

Comments

@DarrenR96
Copy link

It seems as though distributed training with keras version 3.7 and above with tensorflow isn't supported.

https://stackoverflow.com/questions/79285532/multi-gpu-training-in-tensorflow-results-in-nans?noredirect=1#comment140127186_79285532

@dhantule
Copy link
Contributor

Hi @DarrenR96, thanks for reporting this.
Could you please refer this issue and test your code with keras-nightly.

@willianck
Copy link

willianck commented Apr 30, 2025

I also have an issue with the keras version used when trying to run the following dummy example for distributed training. Currently using tensorflow[and-cuda] 2.18.0 which installs with the keras 3.9.2 version:

import tensorflow as tf
import keras

def create_dataset():
    float_data = tf.constant([[1.0, 2.0], [3.0, 4.0]], dtype=tf.float32)
    string_data = tf.constant([["foo", "bar"], ["baz", "qux"]], dtype=tf.string)
    labels = tf.constant([[1], [0]], dtype=tf.float32)
    
    dataset = tf.data.Dataset.from_tensor_slices(((float_data, string_data), labels))
    return dataset

def create_model():
    input_float = keras.Input(shape=(2,), dtype=tf.float32, name='float_input')
    input_string = keras.Input(shape=(2,), dtype=tf.string, name='string_input')
    
    string_lookup = keras.layers.StringLookup(vocabulary=["foo", "bar", "baz", "qux"], name='string_lookup')
    string_embedding = string_lookup(input_string)
    
    concatenated = keras.layers.Concatenate(name='concatenate')([input_float, string_embedding])
    
    dense = keras.layers.Dense(10, activation='relu', name='dense_1')(concatenated)
    output = keras.layers.Dense(1, activation='sigmoid', name='output')(dense)
    
    model = keras.Model(inputs=[input_float, input_string], outputs=output, name='simple_model')
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

def main():

    print("Multiple GPUs strategy")

    strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

    n_gpu: int = len(tf.config.list_physical_devices('GPU'))
    n_replicas: int = strategy.num_replicas_in_sync
 
    print(f'GPU: {n_gpu}')
    print(f'Replicas: {n_replicas}')
    
    dataset = create_dataset()
    
    with strategy.scope():
        model = create_model()
        
    model.fit(dataset.batch(2), epochs=5)

if __name__ == "__main__":
    main()

I get the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: float, double, int32, uint8, int16, int8, complex64, int64, qint8, quint8, qint32, bfloat16, qint16, quint16, uint16, complex128, half, uint32, uint64, variant
        ; NodeDef: {{node AddN}}; Op<name=AddN; signature=inputs:N*T -> sum:T; attr=N:int,min=1; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, DT_INT8, DT_COMPLEX64, DT_INT64, DT_QINT8, DT_QUINT8, DT_QINT32, DT_BFLOAT16, DT_QINT16, DT_QUINT16, DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64, DT_VARIANT]; is_commutative=true; is_aggregate=true> [Op:AddN] name: 

I was able to make it work by downgrading tensorflow and keras to tensorflow[and-cuda]==2.17.0 and keras==3.4.1

A similar issue was posted on the tensorflow repo but has been closed without it seamingly been fixed.

@coldhearti
Copy link

I ran into this issue after upgrading to 3.10. Running mirrored strategy with 4 GPUs. Loss goes NaN nearly immediately.

@dhantule dhantule added the keras-team-review-pending Pending review by a Keras team member. label May 28, 2025
@divyashreepathihalli divyashreepathihalli removed the keras-team-review-pending Pending review by a Keras team member. label May 29, 2025
@divyashreepathihalli
Copy link
Collaborator

Tagging @amitsrivastava78 to take a look

@divyashreepathihalli divyashreepathihalli added the keras-team-review-pending Pending review by a Keras team member. label May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug
Projects
None yet
Development

No branches or pull requests

6 participants