1 year ago
#199982
Mario
Feature importance without label for time-series data with large number of columns/features
I have a sample time-series dataset (23, 14291), which is a pivot table count for 24hrs count for some users; I'm trying to filter some of the columns/features which they don't have a time-series based nature and filter columns to reach meaningful features. I have attempted to already PCA method to keep those with a high amount of data variance or correlation matrix to exclude highly correlated columns/features.
Now I wanted to experiment with feature importance based on this post using some regressors, which was unsuccessful.
I have tried following:
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df3,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import xgboost as xgb
from xgboost import XGBRegressor, plot_importance
X_train = trainingSet[:].values
y_train = trainingSet.iloc[:,1].values
X_test = testSet[:].values
y_test = testSet.iloc[:,1].values
y_test_new = y_test.astype('float32')
dtrain = xgb.DMatrix(X_train,y_train)
dtest = xgb.DMatrix(X_test,y_test)
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
num_round = 2
model_xgb_1user = xgb.train(params, dtrain, num_round)
pred_test_xgb_1user = model_xgb_1user.predict(dtest)
#from sklearn.multioutput import MultiOutputRegressor
#xgb = MultiOutputRegressor(XGBRegressor(n_estimators=100)).fit(trainingSet, testSet)
#xgb = XGBRegressor(n_estimators=100)
#xgb.fit(trainingSet, testSet)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(df3.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
I'm not sure how I can handle it without label using the regressors. I also read this post Xgboost Feature Importance Computed in 3 Ways with Python I couldn't manage to pass time-series based dataset for feature importance.
python
feature-engineering
xgbregressor
0 Answers
Your Answer