- Solve real problems with our hands-on interface
- Progress from basic puts and calls to advanced strategies
Posted July 23, 2024 at 1:00 pm
After you do feature engineering, feature importance is a key step before deploying a strategy backtesting code. Boruta-Shap comes as a viable source for that purpose. However, this algorithm might take a lot of time to run with large datasets. This unique article provides us with an estimation of the mentioned algorithm using CPU parallelism and GPU to make it run faster. Code will be implemented using the XGBoost library and futures library for CPU parallelism.
We will cover:
The Boruta-Shap algorithm is a good technique for feature selection, especially in machine learning and data science applications, is the Boruta-Shap algorithm. Boruta-Shap combines the Boruta feature selection process with the Shapley values to enhance feature importance assessment.
The Boruta-Shap algorithm works in the following way:
The Boruta-Shap algorithm has the following benefits.
Visit QuantInsti website to watch this clip: “Industry expert and renowned author, Dr. Ernest Chan talks about Financial Data Science & Feature Engineering and shares his knowledge:” https://blog.quantinsti.com/boruta-shap-gpu-python/
Despite Boruta-Shap’s strength, its computational cost can be high, particularly for large datasets with many characteristics. To solve this, I’ve included a Boruta-Shap code that uses the CPU and GPU in tandem to expedite the Boruta-Shap’s execution. Cool, right?
This approach drastically cuts computation time by effectively allocating the workload and utilizing the parallel processing powers of both CPUs and GPUs.
Let’s dissect the code. Depending on the number of cores available in your CPU, the code will group the number of trials in buckets and each bucket will be run in parallel. We use a modified version of the code provided by Moosa Ali (2022), who implements the CPU-based algorithm.
Let’s code!
# Import the necessary libraries import scipy as sp import numpy as np import pandas as pd import shap from xgboost import XGBRFClassifier from xgboost import XGBRFRegressor from sklearn.preprocessing import LabelEncoder from concurrent import futures
import_libraries.py hosted with ❤ by GitHub
The following function is responsible for computing the minimum number of trials needed as a threshold to accept an input feature as a selected feature based on the probability mass function (pmf) and a significance level. It iterates over the pmf and accumulates the probabilities until the cumulative probability exceeds the significance level.
# Set the minimum number of trials as a threshold number to accept an input feature as a selected feature def get_tail_items(pmf, significance_level=0.05): # Set total to zero total = 0 # Create a loop based on the probability mass function for i, x in enumerate(pmf): # Increment the total variable with the probability “x” of i total += x # If total is higher than the significance level if total >= significance_level: # Break the code break # Return i return i
get_tail_items_func.py hosted with ❤ by GitHub
The next function selects features based on the number of hits they receive during the trials. It categorizes features into two zones:
# select features from n number of trials def choose_features(feature_hits, TRIALS, thresh): # Define the boundaries for the green zone # Define the green zone threshold green_zone_thresh = TRIALS - thresh # Define the blue zone upper threshold blue_zone_upper = green_zone_thresh # Define the blue zone lower threshold blue_zone_lower = thresh # Select the input features as green whenever their hits are higher than the green zone threshold green_zone = [key for key, value in feature_hits.items() if value >= green_zone_thresh] # Select the input features as blue whenever their hits are between the blue zone lower threshold and the blue zone upper threshold blue_zone = [key for key, value in feature_hits.items() if (value >= blue_zone_lower and value < blue_zone_upper)] return green_zone, blue_zone
choose_features_func.py hosted with ❤ by GitHub
The following last function is the main function implementing the Boruta-Shap algorithm. It takes input data X and target variable y, along with optional parameters such as trials, workers, significance_level, and seed.
Find below what the function does:
def boruta_shap_algorithm(X, y, trials=20, workers=2, significance_level=0.05, seed=2024): # Set the seed np.random.seed(seed) # Assert that the number of samples of both data match assert X.shape[0] == y.shape[0], "X and y dimensions don't coincide" # Set a dictionary to save the number of hits for each feature features_hits = {feature:0 for feature in X.columns} # Create the names of all the features shuffled shuffled_col_names = [str(column+'_shuffle') for column in X.columns] # Set the train and test X data X_train, X_test = X.iloc[:int(0.8*len(X))], X.iloc[int(0.8*len(X)):] # Set the label enconder object le = LabelEncoder() # Transform the y series to a prediction features useful for the machine-learning model label_encoded = le.fit_transform(y) # Transform the encoded label into a Pandas series y = pd.Series(data=label_encoded, index=y.index, name='y') # Set the y train data y_train = y.iloc[:int(0.8*len(y))] # define the model classifier = XGBRFClassifier(n_estimators=100, subsample=1, colsample_bynode=1, tree_method='gpu_hist', random_state=seed) # Define a function to compute the number of times the features def features_hits_func(trial): # Set the seed for the trial np.random.seed(seed+trial) # Set the X train data for the shuffled features X_shuffle_train = X_train.apply(np.random.permutation) # Set the names for the X train shuffled features X_shuffle_train.columns = shuffled_col_names # Set the X-test data for the shuffled features X_shuffle_test = X_test.apply(np.random.permutation) # Set the names for the X-test shuffled features X_shuffle_test.columns = shuffled_col_names # Set the whole input features for the Boruta-Shap algorithm training X_boruta_train = pd.concat([X_train, X_shuffle_train], axis=1) # Set the whole input features for the Boruta-Shap algorithm test data X_boruta_test = pd.concat([X_test, X_shuffle_test], axis=1) # Fit the model model = classifier.fit(X_boruta_train, y_train) # Set the explainer object explainer = shap.TreeExplainer(model) # Get the Shap values for each feature shap_values = explainer.shap_values(X_boruta_test) # Set the mean value of each feature's Shap values features_importance = np.array(np.abs(shap_values).mean(0)) # Set a dataframe with the above features' importance features_importance_df = pd.DataFrame(data=features_importance, index=X_boruta_test.columns, columns=['Values']) # Subset the feature importance dataframe with the non-shuffled features feature_imp_X = features_importance_df.iloc[:len(X.columns)] # Subset the feature importance dataframe with the shuffled features feature_imp_shuffled = features_importance_df.iloc[len(X.columns):] # Add one hit in case the feature is better than the best Shap value of all the shuffled features for feature in feature_imp_X.index: features_hits[feature] += int(feature_imp_X.loc[feature,'Values'] > feature_imp_shuffled['Values'].max()) # Define a function to run multiple trials as per the maximum number of cores available in your CPU def multithreading_loop(function, params_list): # Set the number of lists we'll have as per the number of cores num_lists = int(np.floor(len(params_list)/workers)) # Set the params list to be used to loop params_list_for_loop = params_list[:int(num_lists*workers)] # If the number of trials in the above list is higher than the num_lists if len(params_list)>int(num_lists*workers): # Create the last params list to be used to multithread the computations last_params_list = params_list[int(num_lists*workers):] # For each list of trials for i in range(0,num_lists): # Use the number of cores for the futures library executor with futures.ThreadPoolExecutor(workers) as executor: # Run the features_hits_func function to compute the hits in parallel list(executor.map(function, params_list_for_loop[int(workers*i):int(workers*(i+1))])) # Once you finish the above, run the last trials to be computed in parallel if len(params_list)>int(num_lists*workers): # Use the number of cores for the futures library executor with futures.ThreadPoolExecutor(len(last_params_list)) as executor: # Run the features_hits_func function to compute the hits in parallel list(executor.map(function, last_params_list)) # Set the range for the number of trails as a list trails_list = [*range(trials)] # Run the loop to compute the trails in parallel in buckets multithreading_loop(features_hits_func, trails_list) # Calculate the probability mass function: Get the Binomial distribution in "trials" number of buckets pmf = [sp.stats.binom.pmf(x, trials, .5) for x in range(trials + 1)] # Set the minimum number of trials as the threshold to classify an input feature as a selected feature thresh = get_tail_items(pmf, significance_level) # green are the accepted features, blue are the tentative features green, blue = choose_features(features_hits, trials, thresh) # If there are green features if len(green) != 0: # Return the green features return green # If there aren't green features else: # If there are blue features if len(blue) != 0: # Return the blue features return blue # If there aren't blue features else: # Return all the features return X.columns.tolist()
gpu_cpu_based_boruta_shap_func.py hosted with ❤ by GitHub
References
You have learned how to create the Boruta-Shap algorithm using both the CPU and GPU. You’ll see a great difference, compared with using only the CPU, if you use a dataframe with many observations. Besides, the higher the number of threads and cores, the better the parallelism and the quicker the loop will run.
What’s next? You would ask.
Well, you can use the above code to get the feature importance before you backtest a strategy. We suggest you use the Boruta-Shap algorithm before you optimize a strategy’s parameters. You can find the source file below.
In case you want to learn more about machine learning, keep track of this learning track! You’ll learn the basics of machine learning in finance.
Now that you’ve grasped the power of Boruta Shap for identifying key features, you might be wondering how to put it into practice for real-world problems. Here’s where things get exciting! This Machine Learning & Deep Learning for Trading course by Quantra helps you learn these techniques for building advanced trading strategies. You’ll not only learn the theory behind Boruta Shap but also gain hands-on experience implementing it to select the most impactful features for your own trading algorithms.
It’s the perfect next step to turn your newfound knowledge into action!
Happy Learning!
Originally posted on QuantInsti blog.
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from QuantInsti and is being posted with its permission. The views expressed in this material are solely those of the author and/or QuantInsti and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
Join The Conversation
For specific platform feedback and suggestions, please submit it directly to our team using these instructions.
If you have an account-specific question or concern, please reach out to Client Services.
We encourage you to look through our FAQs before posting. Your question may already be covered!