Evaluating Hyperparameter sweeps with W&B

Sweeps enable us to try lots of different hyperparameters / config with our model and see which combination performs best.

We know how sweeps work on a fundamental level. Now let's use them with a real model and check the results.


Start out by installing the experiment tracking library and setting up your free W&B account:

  • pip install wandb – Install the W&B library
  • import wandb – Import the wandb library
# WandB – Install the W&B library
%pip install wandb -q
import wandb
Explore The Simpsons Dataset

characters = glob.glob('simpsons-dataset/kaggle_simpson_testset/kaggle_simpson_testset/**')
plt.subplots_adjust(wspace=0, hspace=0.1)
i = 0
for character in characters[:15]:
    img = cv2.imread(character)
    img = cv2.resize(img, (250, 250))
    plt.subplot(3, 5, i+1)
    plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    i += 1

# Define the labels for the Simpsons characters we're detecting
character_names = {0: 'abraham_grampa_simpson', 1: 'apu_nahasapeemapetilon', 2: 'bart_simpson', 
        3: 'charles_montgomery_burns', 4: 'chief_wiggum', 5: 'comic_book_guy', 6: 'edna_krabappel', 
        7: 'homer_simpson', 8: 'kent_brockman', 9: 'krusty_the_clown', 10: 'lenny_leonard', 11:'lisa_simpson',
        12: 'marge_simpson', 13: 'mayor_quimby',14:'milhouse_van_houten', 15: 'moe_szyslak', 
        16: 'ned_flanders', 17: 'nelson_muntz', 18: 'principal_skinner', 19: 'sideshow_bob'}
img_size = 64
num_classes = 20
dir = "simpsons-dataset/simpsons_dataset/simpsons_dataset"

# Load training data
X_train = []
y_train = []
for label, name in character_names.items():
   list_images = os.listdir(dir+'/'+name)
   for image_name in list_images:
       image = imageio.imread(dir+'/'+name+'/'+image_name)
       X_train.append(cv2.resize(image, (img_size,img_size)))
X_train = np.array(X_train)
y_train = np.array(y_train)

# Split data for cross validation  
X_test = X_train[-100:] 
y_test = y_train[-100:]

X_train = X_train[:-100] 
y_train = y_train[:-100]

# Normalize the data
X_train = X_train / 255.0
X_test = X_test / 255.0

# One hot encode the labels (neural nets only like numbers)
y_train = np_utils.to_categorical(y_train, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)
len(X_train), len(y_train), len(X_test), len(y_test)
(19448, 19448, 100, 100)
plt.subplots_adjust(wspace=0, hspace=0.1)
p = 1
for i in range(0, len(X_train), len(X_train)//14):
    img = X_train[i]
    label = character_names[y_train[i].argmax(0)]
    img = cv2.resize(img, (250, 250))
    plt.subplot(3, 5, p)
    p += 1

Run A Sweep

I ran a hyperparameter sweep in the weights and biases tool with 32 runs and you can view the report here:

Here you can see the bayesian algorithm gradually improved it's prediction of which combination of hyperparameters to attempt.

Here are all the hyperparameters laid out in an (interactive) visualisation.

A filtered view of the hyperparameters that yielded >80% accuracy (with >90% highlighted) allows some conclusions to be drawn if further sweeps are required.

Retrieve the best model from the W&B sweep

We'll just go ahead and get the best model from all the epochs of all the runs in the sweep.

entity = 'sweep'
project = 'simpsons'
sweep_id = "uqg7jmld"

api = wandb.Api()
sweep = api.sweep(entity + "/" + project + "/" + sweep_id)
runs = sorted(sweep.runs, key=lambda run: run.summary.get("val_accuracy", 0), reverse=True)
val_acc = runs[0].summary.get("val_accuracy", 0)
print(f"Best run {runs[0].name} with {val_acc * 100}% validation accuracy")

print("Best model saved to model-best.h5")
Best run gallant-sweep-26 with 99.00000095367432% validation accuracy
Best model saved to model-best.h5

Load the model

# Recreate the exact same model, including its weights and the optimizer
model = tf.keras.models.load_model('model-best.h5')

# Show the model architecture

Model: "sequential"
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 64, 64, 32)        896       
batch_normalization (BatchNo (None, 64, 64, 32)        128       
dropout (Dropout)            (None, 64, 64, 32)        0         
conv2d_1 (Conv2D)            (None, 64, 64, 64)        18496     
batch_normalization_1 (Batch (None, 64, 64, 64)        256       
max_pooling2d (MaxPooling2D) (None, 32, 32, 64)        0         
conv2d_2 (Conv2D)            (None, 32, 32, 128)       73856     
batch_normalization_2 (Batch (None, 32, 32, 128)       512       
dropout_1 (Dropout)          (None, 32, 32, 128)       0         
conv2d_3 (Conv2D)            (None, 32, 32, 128)       147584    
batch_normalization_3 (Batch (None, 32, 32, 128)       512       
max_pooling2d_1 (MaxPooling2 (None, 16, 16, 128)       0         
conv2d_4 (Conv2D)            (None, 16, 16, 256)       295168    
batch_normalization_4 (Batch (None, 16, 16, 256)       1024      
dropout_2 (Dropout)          (None, 16, 16, 256)       0         
conv2d_5 (Conv2D)            (None, 16, 16, 256)       590080    
batch_normalization_5 (Batch (None, 16, 16, 256)       1024      
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 256)         0         
conv2d_6 (Conv2D)            (None, 8, 8, 512)         1180160   
batch_normalization_6 (Batch (None, 8, 8, 512)         2048      
dropout_3 (Dropout)          (None, 8, 8, 512)         0         
conv2d_7 (Conv2D)            (None, 8, 8, 512)         2359808   
batch_normalization_7 (Batch (None, 8, 8, 512)         2048      
max_pooling2d_3 (MaxPooling2 (None, 4, 4, 512)         0         
conv2d_8 (Conv2D)            (None, 4, 4, 1024)        4719616   
batch_normalization_8 (Batch (None, 4, 4, 1024)        4096      
dropout_4 (Dropout)          (None, 4, 4, 1024)        0         
conv2d_9 (Conv2D)            (None, 4, 4, 1024)        9438208   
batch_normalization_9 (Batch (None, 4, 4, 1024)        4096      
max_pooling2d_4 (MaxPooling2 (None, 2, 2, 1024)        0         
flatten (Flatten)            (None, 4096)              0         
dense (Dense)                (None, 512)               2097664   
batch_normalization_10 (Batc (None, 512)               2048      
dropout_5 (Dropout)          (None, 512)               0         
dense_1 (Dense)              (None, 20)                10260     
Total params: 20,949,588
Trainable params: 20,940,692
Non-trainable params: 8,896

Make some predictions with the model

def get_prediction(x, y):

  # Resize image and normalize it
  pic = cv2.resize(x, (64, 64)).astype('float32')
  if pic.max() > 1.: pic = pic / 255.
  # Get predictions for the character
  prediction = model.predict(pic.reshape(1, 64, 64, 3))[0]

  # Get true name of the character
  character = character_names[y]
  name = character.split('_')[0].title()
  # Format predictions to string to overlay on image
  text = sorted(['{:s} : {:.1f}%'.format(character_names[k].split('_')[0].title(), 100*v) for k,v in enumerate(prediction)], 
      key=lambda x:float(x.split(':')[1].split('%')[0]), reverse=True)[:3]

  # Upscale original image (expecting a 0-255 range here)
  img = cv2.resize(x, (352, 352))
  if np.issubdtype(img.dtype, 'float'): img = (img * 255).astype('uint8')
  # Create background to overlay text on
  cv2.rectangle(img, (0,260),(215,352),(255,255,255), -1)
  # Add text to image
  cv2.putText(img, 'Name : %s' % name, (10, 280), font, 0.7,(73,79,183), 2, cv2.LINE_AA)
  for k, t in enumerate(text):
      color = (10,100,10) if name in t else (80,0,0)
      cv2.putText(img, t, (10, 300+k*18), font, 0.65, color, 2, cv2.LINE_AA)
  title = "%s: %s" % (name, text[0])

  return img, title

Visualisation of performance against the validation data

This should be very good since it was used during training. This is also how we measured the success of the model vs other models. The best run was had a 99% validation accuracy so anything less than that would be a problem here.

plt.subplots_adjust(wspace=0, hspace=0.1)
p = 1
for i in range(0, len(X_test), len(X_test) // 9):
    plt.subplot(2, 5, p)
    p += 1

    x = X_test[i]
    y = y_test[i].argmax()

    (img, label) = get_prediction(x, y)

    plt.imshow( img )
    plt.title( label )

The predictions are all great. Very confident and correct.

But there is a problem here. While the predictions are all great it looks like the validation data contains only pictures of Sideshow Bob.

This is a problem that needs to be fixed but how will the model perform against a dataset it's never seen before?

Visualize predictions against the kaggle testset

This dataset was never used during training and so provides a view of the model performance under exam conditions.

def predict_test():
  predicted_images = []
  for i in range(20):
    character = character_names[i]
    # Read in a character image from the test dataset
    image = cv2.imread(np.random.choice([k for k in glob.glob('simpsons-dataset/kaggle_simpson_testset/kaggle_simpson_testset/*.*') if character in k]))
    # print(image.shape)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    (img, title) = get_prediction(image, i)
    predicted_images.append((img, title))     

  return predicted_images
predicted = predict_test()

plt.subplots_adjust(wspace=0, hspace=0)
p = 1
for i in range(0, len(predicted)):
    img = predicted[i][0]
    label = predicted[i][1]
    img = cv2.resize(img, (250, 250))
    plt.subplot(4, 5, p)
    p += 1

Not bad considering the flaw in our methodology. I've run this a few times now and we occasionally drop one or two for each batch of 25. It's probably around 90% at a guess.

Let's check the whole kaggle test set.

Evaluate entire test set

testset = glob.glob('simpsons-dataset/kaggle_simpson_testset/kaggle_simpson_testset/*.jpg')
img_size = 64
x_testset = []
y_testset = []

for i in range(len(testset)):
    path = testset[i]
    image = imageio.imread(path)
    image = cv2.resize(image, (img_size,img_size))
    if image.shape != (img_size, img_size, 3): continue

    filename = path.split('testset/')[-1]
    names = [k for k, v in character_names.items() if v in filename]
    if not names: continue

x_testset = np.array(x_testset)
y_testset = np.array(y_testset)

# Normalise image data
x_testset = x_testset / 255

print('Making model predictions for %d images' % len(x_testset))
%time prediction = model.predict(x_testset)
predictions = prediction.argmax(1)

print('\nCompare first 24 predictions:')

arr = predictions == y_testset
print('\nAccuracy', np.sum(arr), '/', len(arr))
print(np.sum(arr) / len(arr) * 100, '%')
Making model predictions for 938 images
CPU times: user 1min, sys: 222 ms, total: 1min 1s
Wall time: 31.1 s

Compare first 24 predictions:
[11 17 10 19  4 15  9 17  4 19 15  8  3  0  1 17 14 17 11  8  4  1 15  6]
[11 17 10 19  4 15  9 10  4  0 15  8  3  0  1 10 14 17 11  8  4  1 15  6]

Accuracy 806 / 938
85.9275053304904 %

Confusion Matrix

confusion = confusion_matrix(y_testset, predictions)
labels = list(character_names.values())
plot_conf_matrix(confusion, labels, "Confusion Matrix for best model in W&B sweep")


85% accuracy across 938 test images.

Not bad, but could be improved.

We have some paths to make this better:

  1. Correct the flaw in the validation data.
  2. Run a broad, then narrow sweep
  3. Evaluate again.