of course it fit
and fit_generator
will rarely be equivalent on large datasets. this will happen with any DL framework, not just keras.
the bug is in the fact that deep learning will behave as expected if training and testing data come from similar distributions. while this is almost always the case for small datasets, it might diverge quite dramatically in large ones.