Learn which R packages and data sets you need by reviewing Part I, Part II ,Part III, Part IV, Part V , Part VI and Part VII of this series.
That’s enough data analysis I could probably fit the PACF plots also along with a few more exploratory data analysis but I move on to generating the financial time series features using the tsfeatures
package.
What I do in the below code is to take a random sample of 5 groups (Using the whole data set takes too long to calculate the time series features) and then apply all the functions in the tsfeatures
package to each of the time series assets data which is does by mapping over each assets data and computing the time series features.
This section takes some time to process and compute (especially on the whole sample) and I already saved the results as a csv which I will just work from and load in the pre-computed time series features.
################# Generate Time Series Features ######################
# I create some time series features from the package “tsfeatures”. There are 40+ functions in the “tsfeatures” package
# which can generate approximately 106 time series features.
# Due to memory issues I am only able to create a few of the features, therefore I randomly sample 10 features from the
# “tsfeatures” package. We could also add in technical indicators from the “PerformanceAnalytics” or “TTR” packages (I omit these
# here, however creating ‘functions2 <- ls("package:TTR")' and adding it to the 'summarise' command will work.)
functions <- ls("package:tsfeatures")[1:42]
# functions <- sample(functions, 20)
Stats <- df %>%
group_by(row_id, class) %>%
nest() %>%
ungroup() %>%
sample_n(5) %>%
unnest() %>%
nest(-row_id, -class) %>%
group_by(row_id, class) %T>%
{options(warn = -1)} %>%
summarise(Statistics = map(data, ~ data.frame(
bind_cols(
tsfeatures(.x$value, functions))))) %>%
unnest(Statistics)
# I saved to whole dataset as “Stats” next I split it between training and test.
Stats <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")
Note: Again, bad practice by me. I just called the df
data Stats
which consists of only the time series features. This still only refers to the train_val.csv
data and not the test.csv
data.
The training data looks like: (after computing the time series features). Now each asset has been collapsed from ~260 days down to 1 signal time series feature observation.
Recall the goal here was to classify synthetic time series vs real time series and not what the next days price is going to be. For each asset I have a signal observation and based on this I can train a classifying algorithm to distinguish between real vs synthetic time series.
How the training data looks:
X | row_id | class | ac_9_ac_9 | acf_features_x_acf1 | acf_features_x_acf10 | acf_features_diff1_acf1 | acf_features_diff1_acf10 | acf_features_diff2_acf1 | acf_features_diff2_acf10 | ARCH.LM | autocorr_features_embed2_incircle_1 | autocorr_features_embed2_incircle_2 | autocorr_features_ac_9 | autocorr_features_firstmin_ac | autocorr_features_trev_num | autocorr_features_motiftwo_entro3 | autocorr_features_walker_propcross | binarize_mean_binarize_mean | binarize_mean_NA | compengine_embed2_incircle_1 | compengine_embed2_incircle_2 | compengine_ac_9 | compengine_firstmin_ac | compengine_trev_num | compengine_motiftwo_entro3 | compengine_walker_propcross | compengine_localsimple_mean1 | compengine_localsimple_lfitac | compengine_sampen_first | compengine_std1st_der | compengine_spreadrandomlocal_meantaul_50 | compengine_spreadrandomlocal_meantaul_ac2 | compengine_histogram_mode_10 | compengine_outlierinclude_mdrmd | compengine_fluctanal_prop_r1 | crossing_points | dist_features_histogram_mode_10 | dist_features_outlierinclude_mdrmd | embed2_incircle | entropy | firstmin_ac | firstzero_ac | flat_spots | fluctanal_prop_r1_fluctanal_prop_r1 | arch_acf | garch_acf | arch_r2 | garch_r2 | histogram_mode | alpha | beta | hurst | hw_parameters_hw_parameters | hw_parameters_NA | localsimple_taures | lumpiness | max_kl_shift | time_kl_shift | max_level_shift | time_level_shift | max_var_shift | time_var_shift | motiftwo_entro3 | nonlinearity | outlierinclude_mdrmd | x_pacf5 | diff1x_pacf5 | diff2x_pacf5 | pred_features_localsimple_mean1 | pred_features_localsimple_lfitac | pred_features_sampen_first | sampen_first_sampen_first | sampenc | scal_features_fluctanal_prop_r1 | spreadrandomlocal_meantaul | stability | station_features_std1st_der | station_features_spreadrandomlocal_meantaul_50 | station_features_spreadrandomlocal_meantaul_ac2 | std1st_der_std1st_der | nperiods | seasonal_period | trend | spike | linearity | curvature | e_acf1 | e_acf10 | trev_num | tsfeatures_frequency | tsfeatures_nperiods | tsfeatures_seasonal_period | tsfeatures_trend | tsfeatures_spike | tsfeatures_linearity | tsfeatures_curvature | tsfeatures_e_acf1 | tsfeatures_e_acf10 | tsfeatures_entropy | tsfeatures_x_acf1 | tsfeatures_x_acf10 | tsfeatures_diff1_acf1 | tsfeatures_diff1_acf10 | tsfeatures_diff2_acf1 | tsfeatures_diff2_acf10 | unitroot_kpss | unitroot_pp | walker_propcross |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | -0.0675275 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0627825 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 1.788841 | 1.408737 | 1.68 | 1.43 | -0.25 | -0.2865385 | 0.1627907 | 132 | -0.25 | -0.2865385 | 0.3929961 | 0.9840151 | 1 | 3 | 4 | 0.1627907 | 0.0652585 | 0.0154406 | 0.0627825 | 0.0253367 | -0.25 | 0.0013330 | 0.0013330 | 0.5000458 | NA | NA | 1 | 0.3556536 | 1.783636 | 103 | 1.297736 | 97 | 2.819828 | 46 | 2.071663 | 0.0752319 | -0.2865385 | 0.0108653 | 0.4457792 | 1.0525222 | 1 | 1 | 1.788841 | 1.788841 | 1.788841 | 0.1627907 | 1.76 | 0.0562693 | 1.408737 | 1.74 | 1.36 | 1.408737 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.1208750 | 1 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.9840151 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0993829 | -249.7732 | 0.5405405 |
2 | 2 | 0 | -0.0421577 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0558032 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 1.780390 | 1.419266 | 1.95 | 1.00 | 0.50 | 0.2615385 | 0.1627907 | 123 | 0.50 | 0.2615385 | 0.4285714 | 0.9864332 | 1 | 1 | 4 | 0.1627907 | 0.0664358 | 0.0657859 | 0.0558032 | 0.0554355 | 0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.4636768 | 1.733008 | 247 | 1.311861 | 141 | 2.625772 | 221 | 2.077581 | 0.0273335 | 0.2615385 | 0.0256032 | 0.4606850 | 1.0171377 | 1 | 1 | 1.780390 | 1.780390 | 1.780390 | 0.1627907 | 2.05 | 0.0892206 | 1.419266 | 2.12 | 1.00 | 1.419266 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | -0.4765229 | 1 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | 0.9864332 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0414599 | -256.0485 | 0.5019305 |
3 | 3 | 1 | 0.0099598 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0325470 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 0 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 1 | 1.706841 | 1.443315 | 1.38 | 1.00 | -0.50 | -0.2538462 | 0.1395349 | 132 | -0.50 | -0.2538462 | 0.4671815 | 0.9868568 | 1 | 1 | 6 | 0.1395349 | 0.0388513 | 0.0039162 | 0.0325470 | 0.0041902 | -0.50 | 0.0014557 | 0.0014557 | 0.5000458 | NA | NA | 1 | 1.2670493 | 7.746711 | 95 | 1.403784 | 87 | 5.235499 | 84 | 2.069233 | 0.2436499 | -0.2538462 | 0.0223069 | 0.5356408 | 0.9954919 | 1 | 1 | 1.706841 | 1.706841 | 1.706841 | 0.1395349 | 1.42 | 0.0716499 | 1.443315 | 1.42 | 1.00 | 1.443315 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | -0.8755173 | 1 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | 0.9868568 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0775698 | -258.1295 | 0.5328185 |
4 | 4 | 0 | -0.0428748 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.1275576 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 0 | 0 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 1 | 1 | 1.660825 | 1.445807 | 1.24 | 1.00 | 0.25 | 0.0153846 | 0.1395349 | 127 | 0.25 | 0.0153846 | 0.4555985 | 0.9790521 | 2 | 1 | 7 | 0.1395349 | 0.0694296 | 0.0112709 | 0.0579144 | 0.0123884 | 0.25 | 0.0480021 | 0.0001000 | 0.5000458 | NA | NA | 1 | 1.0068624 | 4.994753 | 132 | 1.258758 | 173 | 5.886911 | 156 | 2.068744 | 0.3840091 | 0.0153846 | 0.0503205 | 0.5402603 | 1.1070217 | 1 | 1 | 1.660825 | 1.660825 | 1.660825 | 0.1395349 | 1.10 | 0.1065111 | 1.445807 | 1.14 | 1.00 | 1.445807 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | -0.9943808 | 1 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | 0.9790521 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.2129633 | -262.0781 | 0.4903475 |
5 | 5 | 0 | 0.0259312 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.2014861 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 0 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 1 | 1.347789 | 1.580825 | 1.08 | 0.98 | -0.50 | 0.7961538 | 0.1627907 | 133 | -0.50 | 0.7961538 | 0.6563707 | 0.9723766 | 1 | 1 | 9 | 0.1627907 | 0.2718058 | 0.2229375 | 0.1765130 | 0.1330761 | -0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 2.8846415 | 11.474426 | 80 | 1.772392 | 229 | 8.468236 | 236 | 2.059764 | 0.2143595 | 0.7961538 | 0.1008392 | 0.7538746 | 1.2926800 | 1 | 1 | 1.347789 | 1.347789 | 1.347789 | 0.1627907 | 1.08 | 0.0797924 | 1.580825 | 1.06 | 0.98 | 1.580825 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | -0.7167079 | 1 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | 0.9723766 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.1506344 | -323.5672 | 0.5289575 |
6 | 6 | 0 | -0.0761166 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0432628 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 1.751575 | 1.381854 | 2.69 | 1.71 | -0.25 | -0.0846154 | 0.3488372 | 134 | -0.25 | -0.0846154 | 0.4352941 | 0.9806218 | 1 | 5 | 5 | 0.3488372 | 0.0500806 | 0.0502154 | 0.0627968 | 0.0620877 | -0.25 | 0.0286244 | 0.0001000 | 0.5188805 | NA | NA | 1 | 0.2189481 | 3.145763 | 141 | 1.447883 | 80 | 2.077936 | 84 | 2.068914 | 0.0137733 | -0.0846154 | 0.0172321 | 0.4345976 | 1.0881798 | 1 | 1 | 1.751575 | 1.751575 | 1.751575 | 0.3488372 | 2.61 | 0.1479673 | 1.381854 | 2.63 | 1.81 | 1.381854 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.0898648 | 1 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.9806218 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0259414 | -262.3484 | 0.5250965 |
## [1] 12000 109
The dimensions of the data as still 12,000 with 109 features (created from the tsfeatures package). That is we have 6,000 synthetic and 6,000 real financial time series (12,000 * ~260 = 3,120,000 but we applied tsfeatures to collapse the ~260 down to 1 single observation for each asset)
I collapsed this problem down from a time series expectation problem to a pure classification problem. I split the data between training and validation set next… I also split the data into X_train
, Y_train
… etc.
I split the df/Stats
data set into a train set of 75% of the observations and an in-sample test data set of 25% of the observations.
######################################################################
################# Train and XGBoost model on the TS Features #########
#Stats <- Stats %>%
# select_if(~sum(!is.na(.)) > 0)
# Split the training set up between train and a small validation set
smp_size <- floor(0.75 * nrow(Stats))
#set.seed(123)
train_ind <- sample(seq_len(nrow(Stats)), size = smp_size)
train <- Stats[train_ind, ]
val <- Stats[-train_ind, ]
# We have 106 time series features for the model to learn from.
x_train <- train %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()
x_val <- val %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()
y_train <- train %>%
ungroup() %>%
pull(class)
y_val <- val %>%
ungroup() %>%
pull(class)
Stay tuned for the next installment to find out how the training X (input variables) data looks.
Visit Matthew Smith – R Blog to download the complete R code and see additional details featured in this tutorial: https://lf0.com/post/synth-real-time-series/financial-time-series/
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from Matthew Smith - R Blog and is being posted with its permission. The views expressed in this material are solely those of the author and/or Matthew Smith - R Blog and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.