1 year ago
#276933
Maths12
Correct way to use calibrated Classifer with pipeline
I train a model as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=random_state_split_data)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=random_state_split_data)
under = RandomUnderSampler(sampling_strategy=0.2)
X_train,y_train = under.fit_resample(X_train,y_train)
#define pipeline
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=100)
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
model = XGBClassifier(objective='binary:logistic',n_jobs=29,use_label_encoder=False,random_state = 42)
pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])
i then do a gridsearch on this pipeline
gridsearch = GridSearchCV(pipe, param_grid, cv=3, verbose=1,n_jobs=-1)
gridsearch.fit(X_train, y_train)
my result is:
best_est = gridsearch.best_estimator_
I then carry out calibration:
X_validation_calibrate = pd.DataFrame(best_est[:-1].transform(X_validation),columns=features_cols)
X_test_calibrate = pd.DataFrame(best_est[:-1].transform(X_test),columns=features_cols)
I pass these through the calibration e.g. a snippet is
sig_clf = CalibratedClassifierCV(best_est['clf'], method="sigmoid", cv="prefit")
iso_clf = CalibratedClassifierCV(best_est['clf'], method="isotonic", cv="prefit")
sig_clf.fit(X_validation_calibrate, y_valid)
iso_clf.fit(X_validation_calibrate, y_valid)
My SIG_CLF had the best calibration so i would like to use this rather than my 'best_est['clf']. Therefore the sig_clf above is just taking the model not preprocessing. When i come to make predictions on other datasets e.g. 'newdata' does the following make sense?
test1 = best_est[:-1].transform(newdata)
predictions_new = sig_clf.predict_proba(test1)
Above i am using every part of the pipeline to transform an external dataset called 'newdata' then i apply the calibrated sigmoid model onto the transformed dataset to give me final calibrated predictions. Is this correct?
scikit-learn
pipeline
probability
xgboost
calibration
0 Answers
Your Answer