Correct way to use calibrated Classifer with pipeline

2 years ago

#276933

Maths12

I train a model as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)

under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)

#define pipeline 
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

i then do a gridsearch on this pipeline

gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)

my result is:

best_est = gridsearch.best_estimator_

I then carry out calibration:

X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)

I pass these through the calibration e.g. a snippet is

sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")

sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)

My SIG_CLF had the best calibration so i would like to use this rather than my 'best_est['clf']. Therefore the sig_clf above is just taking the model not preprocessing. When i come to make predictions on other datasets e.g. 'newdata' does the following make sense?

test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)

Above i am using every part of the pipeline to transform an external dataset called 'newdata' then i apply the calibrated sigmoid model onto the transformed dataset to give me final calibrated predictions. Is this correct?

scikit-learn

pipeline

probability

xgboost

calibration

0 Answers

Your Answer

Posts

Questions

Blogs