Higgs Dataset Signal/Background Detection Model
Blog post contains summary of a mini-project on Higgs Dataset from UCI Repo. It involves the classification of Higgs's signal/Background Noise.
- Downloading the Dataset:
- Processing the Data:
- MODEL 1: Dataset with 100,000 labelled examples
- MODEL 2: Dataset with 500,000 labelled examples
- MODEL 3: Dataset with 1 Million labelled examples
- MODEL 4: Dataset with 1 Million Examples using Normalized Data
- References:
The Higgs Dataset is one of the largest dataset containing labelled data whether a signal is actual higgs signal or of the background. This is a Classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not.
source of the dataset: Daniel Whiteson daniel '@' uci.edu, Assistant Professor, Physics & Astronomy, Univ. of California Irvine
Downloading the Dataset:
First, we are goint to download the dataset from the UCI Machine Learning repository and since it is in the .gz (zip format) format we are going to unzip the data and get the data ready for processing.
Data Set Information:
The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
!gzip -d HIGGS.csv.gz
import pandas as pd
columns = ["labels","lepton pT", "lepton eta", "lepton phi", "missing energy magnitude", "missing energy phi", "jet 1 pt", "jet 1 eta", "jet 1 phi", "jet 1 b-tag", "jet 2 pt", "jet 2 eta", "jet 2 phi", "jet 2 b-tag", "jet 3 pt", "jet 3 eta", "jet 3 phi", "jet 3 b-tag", "jet 4 pt", "jet 4 eta", "jet 4 phi", "jet 4 b-tag", "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]
higgs_df = pd.read_csv("/content/HIGGS.csv", header= None, names = columns)
higgs_df
y = higgs_df["labels"]
y
X = higgs_df.drop(columns = 'labels')
X
MODEL 1: Dataset with 100,000 labelled examples
Instead of taking our entire dataset, which is very huge we are going to first try the 100,000 examples for better understanding the future prospects of our model. Insted of dividing the dataset serial-wise we are going to use sample
method from pandas to randomnly sample 100k examples from our dataset. We are going to use the seed values as 42
for reproducing the results later.
higgs_df_subset =higgs_df.sample(n = 100000, random_state = 42)
higgs_df_subset
higgs_df_subset.to_csv('higgs_dataset_subset.csv', index=False)
- We save the randomnly sampled data of 100k examples into a csv file for future use.
X= higgs_df_subset.drop(columns = "labels")
X
y = higgs_df_subset["labels"]
y
X.shape
- Here, we are going to divide the labelled dataset for supervised learning. We are extracting labels in to another variable 'y'.
- And we are dropping the labels and taking the other columns of the dataset into variable 'X', that will serve as input data or feature set to use it as input values in our model.
Splitting the data into Train and Test sets
we use sklearn's train_test_split
method to split the dataset into training set (which we will be using for training of our Neural network) and Test set(on which we can measure the accuracy and other metrics of our model). Test set is also called validation set in this context, it is used to measure how well our Neural network is generalizing for new data that it has not seen before. We are splitting 10 percent of the entire dataset into test set and other for training (since we Neural Networks need a lot of data to train better)
import sklearn
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y, test_size=0.1, random_state=42, shuffle=True)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
- We are importing some helper function to help us evaluate and visualize our model's results.
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
from helper_functions import confusion_matrix, plot_loss_curves, create_tensorboard_callback
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_1 = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_1.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ['accuracy'])
history_1 = higgs_model_1.fit(X_train, y_train, epochs = 5, validation_data = (X_test,y_test), callbacks =[create_tensorboard_callback(dir_name = "higgs_noise_detection_model", experiment_name='HIGGS_MODEL_NOISE_DETECTION_100K')])
plot_loss_curves(history_1)
The Training accuracy for our baseline model is 68.69%.
The Validation accuracy of our baseline model is 68.07%
Even though the metrics turned out not very good. Our model is good at generalizing the features it is leanring from training data so that it can perform well in our Validation data(the data which model has not seen before). Our model is not overfitting
tf.keras.models.save_model(higgs_model_1,filepath = "/content/drive/MyDrive/higgs_model_100K" )
MODEL 2: Dataset with 500,000 labelled examples
Instead of taking our entire dataset, which is very huge we are going to first try the 500,000 examples. We are going to use sample
method from pandas to randomnly sample 500k examples from our dataset. We are going to use the seed values as 42
for reproducing the results later.
higgs_df_subset_500k =higgs_df.sample(n = 500000, random_state = 42)
higgs_df_subset_500k
higgs_df_subset_500k.to_csv('higgs_dataset_subset_500k.csv', index=False)
X_1= higgs_df_subset_500k.drop(columns = "labels")
y_1 =higgs_df_subset_500k["labels"]
import sklearn
from sklearn.model_selection import train_test_split
X_train_1,X_test_1, y_train_1, y_test_1 = sklearn.model_selection.train_test_split(X_1,y_1, test_size=0.1, random_state=42, shuffle=True)
X_train_1.shape, X_test_1.shape, y_train_1.shape, y_test_1.shape
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_1_500k = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_1_500k.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ['accuracy'])
history_1_500k = higgs_model_1_500k.fit(X_train_1, y_train_1, epochs = 5, validation_data = (X_test_1,y_test_1), callbacks =[create_tensorboard_callback(dir_name = "higgs_noise_detection_model", experiment_name='HIGGS_MODEL_NOISE_DETECTION_WITH_500K')])
higgs_model_1_500k.evaluate(X_test_1, y_test_1)
plot_loss_curves(history_1_500k)
tf.keras.models.save_model(higgs_model_1_500k,filepath = "/content/drive/MyDrive/higgs_model_500K" )
MODEL 3: Dataset with 1 Million labelled examples
Instead of taking our entire dataset, which is very huge we are going to first try the 1,000,000 examples for better understanding the future prospects of our model. Insted of dividing the dataset serial-wise we are going to use sample
method from pandas to randomnly sample 1 million examples from our dataset. We are going to use the seed values as 42
for reproducing the results later.
higgs_df_subset_1M =higgs_df.sample(n = 1000000, random_state = 42)
higgs_df_subset_1M
higgs_df_subset_1M.to_csv('higgs_dataset_subset_1M.csv', index=False)
X_2 = higgs_df_subset_1M.drop(columns = "labels")
y_2 = higgs_df_subset_1M["labels"]
import sklearn
from sklearn.model_selection import train_test_split
X_train_2,X_test_2, y_train_2, y_test_2 = sklearn.model_selection.train_test_split(X_2,y_2, test_size=0.1, random_state=42, shuffle=True)
X_train_2.shape, X_test_2.shape, y_train_2.shape, y_test_2.shape
higgs_model_1.evaluate(X_test, y_test)
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_1_1M = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_1_1M.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ['accuracy'])
history_1_1M = higgs_model_1_1M.fit(X_train_2, y_train_2, epochs = 5,validation_data = (X_test_2,y_test_2), callbacks =[create_tensorboard_callback(dir_name = "higgs_noise_detection_model", experiment_name='HIGGS_MODEL_NOISE_DETECTION_WITH_1M')])
higgs_model_1_1M.evaluate(X_test_2, y_test_2)
plot_loss_curves(history_1_1M)
tf.keras.models.save_model(higgs_model_1_1M,filepath = "/content/drive/MyDrive/higgs_model_1M" )
from sklearn.metrics import confusion_matrix
import numpy as np
y_pred = higgs_model_1_1M.predict(X_test_2)
y_pred_rev = y_pred > 0.7
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
y_test_2.shape
from helper_functions import make_confusion_matrix
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_rev, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=False)
y_pred = higgs_model_1_1M.predict(X_test_2)
y_pred_rev = y_pred > 0.85
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
y_pred = higgs_model_1_1M.predict(X_test_2)
y_pred_rev = y_pred > 0.9
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
!tensorboard dev upload --logdir ./higgs_noise_detection_model \
--name "Higgs Dataset Background Noise detection model" \
--description " A background noise detection model for higgs data set containing signals from LHC runs" \
--one_shot # Exits the uploader once its finished uploading
import pandas as pd
import numpy as np
higgs_df_subset_1M = pd.read_csv("/content/drive/MyDrive/Higgs_Dataset_Subset/higgs_dataset_subset_1M.csv")
higgs_df_subset_1M
X_2 = higgs_df_subset_1M.drop(columns = "labels")
y_2 = higgs_df_subset_1M["labels"]
import sklearn
from sklearn.model_selection import train_test_split
X_train_2,X_test_2, y_train_2, y_test_2 = sklearn.model_selection.train_test_split(X_2,y_2, test_size=0.1, random_state=42, shuffle=True)
X_train_normalize = sklearn.preprocessing.normalize(X_train_2, norm='l2')
X_train_normalize
X_test_normalize = sklearn.preprocessing.normalize(X_test_2, norm='l2')
X_test_normalize
X_train_normalize.shape
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_1_normalize_1M = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_1_normalize_1M.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ['accuracy'])
history_1_normalize_1M = higgs_model_1_normalize_1M.fit(X_train_normalize, y_train_2, epochs = 5, validation_data = (X_test_normalize,y_test_2))
y_pred = higgs_model_1_normalize_1M.predict(X_test_normalize)
y_pred_rev = y_pred > 0.9
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
y_pred = higgs_model_1_normalize_1M.predict(X_test_2)
y_pred_rev = y_pred > 0.9
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
from helper_functions import make_confusion_matrix
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
from helper_functions import plot_loss_curves
plot_loss_curves(history_1_normalize_1M)
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_2_normalize_1M = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_2_normalize_1M.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
metrics = ['accuracy'])
history_2_normalize_1M = higgs_model_1_normalize_1M.fit(X_train_normalize, y_train_2, epochs = 5, validation_data = (X_test_normalize,y_test_2))
y_pred = higgs_model_2_normalize_1M.predict(X_test_2)
y_pred_rev = y_pred > 0.6
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
tf.keras.models.save_model(higgs_model_2_normalize_1M,filepath = "/content/drive/MyDrive/higgs_model_2_normalize_1M" )
from helper_functions import create_tensorboard_callback
import tensorflow as tf
tf.random.set_seed(42)
higgs_model_3_normalize_1M = tf.keras.Sequential([
tf.keras.layers.Dense(300, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_3_normalize_1M.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.05),
metrics = ['accuracy'])
history_3_normalize_1M = higgs_model_3_normalize_1M.fit(X_train_normalize, y_train_2, epochs = 5, validation_data = (X_test_normalize,y_test_2),callbacks =[create_tensorboard_callback(dir_name = "higgs_noise_detection_model_with_300_units", experiment_name='HIGGS_MODEL_NOISE_DETECTION_WITH_1M_300HIDDENUNITS')])
y_pred = higgs_model_3_normalize_1M.predict(X_test_normalize)
y_pred_rev = y_pred > 0.9
y_pred_revised = np.array(y_pred_rev)
y_pred_revised = tf.squeeze(y_pred_revised)
y_pred_revised.shape
make_confusion_matrix(y_true= y_test_2, y_pred=y_pred_revised, classes=None, figsize=(10, 10), text_size=15, norm=False, savefig=True)
higgs_model_2.evaluate(X_test, y_test)
plot_loss_curves(history_2)
tf.random.set_seed(42)
higgs_model_3 = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(10, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
])
higgs_model_3.compile(loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics = ['accuracy'])
history_3 = higgs_model_3.fit(X_train_normalize, y_train, epochs = 5, validation_data = (X_test,y_test))
higgs_model_3.evaluate(X_test, y_test)
plot_loss_curves(history_3)