{"id":203181,"date":"2024-03-07T11:30:18","date_gmt":"2024-03-07T16:30:18","guid":{"rendered":"https:\/\/ibkrcampus.com\/?p=203181"},"modified":"2024-03-07T11:33:16","modified_gmt":"2024-03-07T16:33:16","slug":"clean-transform-optimize-the-power-of-data-preprocessing-part-ii","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/","title":{"rendered":"Clean, Transform, Optimize: The Power of Data Preprocessing &#8211; Part II"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>See <a href=\"\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-i\/\">Part I<\/a> for an intro to data preprocessing.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-preprocessing-with-python-for-different-dataset-types\">Data preprocessing with Python for different dataset types<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that you know the different dataset errors, we can go ahead with learning how to use Python for preprocessing such a dataset.<strong><a href=\"https:\/\/bdataanalytics.biomedcentral.com\/articles\/10.1186\/s41044-016-0014-0\">\u207d\u00b2\u207e<\/a><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us learn about these dataset errors:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Missing values in a dataset<\/li>\n\n\n\n<li>Outliers in a dataset<\/li>\n\n\n\n<li>Overfitting in a dataset<\/li>\n\n\n\n<li>Data with no numerical values in a dataset<\/li>\n\n\n\n<li>Different date formats in a dataset<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"missing-values-in-a-dataset\">Missing values in a dataset<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Missing values are a common problem while dealing with data! The values can be missed because of various reasons such as human errors, mechanical errors, etc.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Data cleansing is an important step before you even begin the algorithmic trading process, which begins with historical data analysis to make the prediction model as accurate as possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Based on this prediction model you create the trading strategy. Hence, leaving missed values in the dataset can wreak havoc by giving faulty predictive results that can lead to erroneous strategy creation and further the results can not be great to state the obvious.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are three techniques to solve the missing values\u2019 problem to find out the most accurate features, and they are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Dropping<\/li>\n\n\n\n<li>Numerical imputation<\/li>\n\n\n\n<li>Categorical imputation<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Dropping<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Dropping is the most common method to take care of the missed values. Those rows in the dataset or the entire columns with missed values are dropped to avoid errors from occurring in data analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Some machines are programmed to automatically drop the rows or columns that include missed values resulting in a reduced training size. Hence, the dropping can lead to a decrease in the model performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A simple solution for the problem of a decreased training size due to the dropping of values is to use imputation. We will discuss the interesting imputation methods further. In case of dropping, you can define a threshold to the machine.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, the threshold can be anything. It can be 50%, 60% or 70% of the data. Let us take 60% in our example, which means that 60% of data with missing values will be accepted by the model\/algorithm as the training dataset, but the features with more than 60% missing values will be dropped.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For dropping the values, the following Python codes are used:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#Dropping columns in the data higher than 60% threshold\ndata = data[data.columns[data.isnull().mean() &lt; threshold]]\n\n#Dropping rows in the data higher than 60% threshold\ndata = data.loc[data.isnull().mean(axis=1) &lt; threshold]<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/273e6b316678810e685bd567e468814a#file-dropping-py\">Dropping.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By using the above Python codes, the missed values will be dropped and the machine learning model will learn on the rest of the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Numerical imputation<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The word imputation implies replacing the missing values with such a value that makes sense. And, numerical imputation is done in the data with numbers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, if there is a tabular dataset with the number of stocks, commodities and derivatives traded in a month as the columns, it is better to replace the missed value with a \u201c0\u201d than leave them as it is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With numerical imputation, the data size is preserved and hence, predictive models like linear regression can work better to predict most accurately.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A linear regression model can not work with missing values in the dataset since it is biased toward the missed values and considers them \u201cgood estimates\u201d. Also, the missed values can be replaced with the median of the columns since median values are not sensitive to outliers, unlike averages of columns.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us see the Python codes for numerical imputation, which are as follows:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#For filling all the missed values as 0\ndata = data.fillna(0)\n\n#For replacing missed values with median of columns\ndata = data.fillna(data.median())<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/2a5ce99125d16624de1c62f44b2bf007#file-numerical-imputation-py\">Numerical imputation.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Categorical imputation<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This technique of imputation is nothing but replacing the missed values in the data with the one which occurs the maximum number of times in the column. But, in case there is no such value that occurs frequently or dominates the other values, then it is best to fill the same as \u201cNAN\u201d.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The following Python code can be used here:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#Categorical imputation\ndata['column_name'].fillna(data['column_name'].value_counts().idxmax(), inplace=True)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/df804e8ff360ce19681371849d3eb410#file-categorical-imputation-py\">Categorical imputation.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"outliers-in-a-dataset\">Outliers &nbsp;in a dataset<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">An outlier differs significantly from other values and is too distanced from the mean of the values. Such values that are considered outliers are usually due to some systematic errors or flaws.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us see the following Python codes for identifying and removing outliers with&nbsp;<a href=\"https:\/\/quantra.quantinsti.com\/glossary\/Standard-Deviation\">standard deviation<\/a>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#For identifying the outliers with the standard deviation method\noutliers = [x for x in data if x &lt; lower or x &gt; upper]\nprint('Identified outliers: %d' % len(outliers))\n\n#Remove outliers\noutliers_removed = [x for x in data if x &gt;= lower and x &lt;= upper]\nprint('Non-outlier observations: %d' % len(outliers_removed))<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/06abdf4f7c4bc20d6e20e418cfb77e85#file-identify-and-remove-py\">Identify and remove.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the codes above, \u201clower\u201d and \u201cupper\u201d signify the upper and lower limit in the dataset.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"overfitting-in-a-dataset\">Overfitting &nbsp;in a dataset<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In both machine learning and statistics,&nbsp;<a href=\"https:\/\/blog.quantinsti.com\/machine-learning-basics\/\">overfitting<\/a>&nbsp;occurs when the model fits the data too well or simply put when the model is too complex.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The overfitting model learns the detail and noise in the training data to such an extent that it negatively impacts the performance of the model on new data\/test data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The overfitting problem can be solved by decreasing the number of features\/inputs or by increasing the number of training examples to make the machine learning algorithms more generalised.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most common solution is regularisation in an overfitting case. Binning is the technique that helps with the regularisation of the data which also makes you lose some data every time you regularise it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, in the case of numerical binning, the data can be as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Stock value<\/strong><\/td><td><strong>Bin<\/strong><\/td><\/tr><tr><td>100-250<\/td><td>Lowest<\/td><\/tr><tr><td>251-400<\/td><td>Mid<\/td><\/tr><tr><td>401-500<\/td><td>High<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the Python code for binning:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">data['bin'] = pd.cut(data['value'], bins=[100,250,400,500], labels=[\"Lowest\", \"Mid\", \"High\"])<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/bb93428bf25b6860ef458482c5d3b22a#file-binning-py\">Binning.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Your output should look something like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">     Value    Bin\n0     102     Low\n1     300     Mid\n2     107     Low\n3     470     High<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"data-with-no-numerical-values-in-a-dataset\">Data with no numerical values &nbsp;in a dataset<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the case of the dataset with no numerical values, it becomes impossible for the machine learning model to learn the information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The machine learning model can only handle numerical values and thus, it is best to spread the values in the columns with assigned binary numbers \u201c0\u201d or \u201c1\u201d. This technique is known as one-hot encoding.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this type of technique, the grouped columns already exist. For instance, below I have mentioned a grouped column:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Infected&nbsp;<\/strong><\/td><td><strong>Covid variants<\/strong><\/td><\/tr><tr><td>2<\/td><td>Delta<\/td><\/tr><tr><td>4<\/td><td>Lambda<\/td><\/tr><tr><td>5<\/td><td>Omicron<\/td><\/tr><tr><td>6<\/td><td>Lambda<\/td><\/tr><tr><td>4<\/td><td>Delta<\/td><\/tr><tr><td>3<\/td><td>Omicron<\/td><\/tr><tr><td>5<\/td><td>Omicron<\/td><\/tr><tr><td>4<\/td><td>Lambda&nbsp;<\/td><\/tr><tr><td>2<\/td><td>Delta<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now, the above-grouped data can be encoded with the binary numbers \u201d0\u201d and \u201c1\u201d with one hot encoding technique. This technique subtly converts the categorical data into a numerical format in the following manner:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Infected&nbsp;<\/strong><\/td><td><strong>Delta<\/strong><\/td><td><strong>Lambda<\/strong><\/td><td><strong>Omicron<\/strong><\/td><\/tr><tr><td>2<\/td><td>1<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td>4<\/td><td>0<\/td><td>1<\/td><td>0<\/td><\/tr><tr><td>5<\/td><td>0<\/td><td>0<\/td><td>1<\/td><\/tr><tr><td>6<\/td><td>0<\/td><td>1<\/td><td>0<\/td><\/tr><tr><td>4<\/td><td>1<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td>3<\/td><td>0<\/td><td>0<\/td><td>1<\/td><\/tr><tr><td>5<\/td><td>0<\/td><td>0<\/td><td>1<\/td><\/tr><tr><td>4<\/td><td>0<\/td><td>1<\/td><td>0<\/td><\/tr><tr><td>2<\/td><td>1<\/td><td>0<\/td><td>0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Hence, it results in better handling of grouped data by converting the same into encoded data for the machine learning model to grasp the encoded (which is numerical) information quickly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem with the approach<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Going further, in case there are more than three categories in a dataset that is to be used for feeding the machine learning model, the one-hot encoding technique will create as many columns. Let us say, there are 2000 categories, then this technique will create 2000 columns and it will be a lot of information to feed to the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Solution<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To solve this problem, while using this technique, we can apply the target encoding technique which implies calculating the \u201cmean\u201d of each predictor category and using the same mean for all other rows with the same category under the predictor column. This will convert the categorical column into the numeric column and that is our main aim.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us understand this with the same example as above but this time we will use the \u201cmean\u201d of the values under the same category in all the rows. Let us see how.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Python, we can use the following code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#Convert data into numerical values with mean\nInfected = [2, 4, 5, 6, 4, 3]\nPredictor = ['Delta', 'Lambda', 'Omicron\u2019, \u2019Lambda\u2019, \u2019Delta\u2019, \u2019Omicron\u2019]\nInfected_df = pd.DataFrame(data={'Infected':Infected, 'Predictor':Predictor})\nmeans = Infected_df.groupby('Predictor')['Infected'].mean()\nInfected_df['Predictor_encoded'] = Infected_df['predictor'].map(means)\nInfected_df<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/c2416193e9ab6bebada455f46ddace1b#file-data-with-no-numerical-values-py\">Data with no numerical values.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Output:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Infected&nbsp;<\/td><td>Predictor<\/td><td>Predictor_encoded&nbsp;<\/td><\/tr><tr><td>2<\/td><td>Delta<\/td><td>3<\/td><\/tr><tr><td>4<\/td><td>Lambda<\/td><td>5<\/td><\/tr><tr><td>5<\/td><td>Omicron<\/td><td>4<\/td><\/tr><tr><td>6<\/td><td>Lambda<\/td><td>5<\/td><\/tr><tr><td>4<\/td><td>Delta<\/td><td>3<\/td><\/tr><tr><td>3<\/td><td>Omicron<\/td><td>4<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In the output above, the Predictor column depicts the Covid variants and the Predictor_encoded column depicts the \u201cmean\u201d of the same category of Covid variants which makes 2+4\/2 = 3 as the mean value for Delta, 4+6\/2 = 5 as the mean value for Lambda and so on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Hence, the machine learning model will be able to feed the main feature (converted to a number) for each predictor category for the future.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"different-date-formats-in-a-dataset\">Different date formats &nbsp;in a dataset<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">With the different date formats such as \u201c25-12-2021\u201d, \u201c25th December 2021\u201d etc. the machine learning model needs to be equipped with each of them. Or else, it is difficult for the machine learning model to understand all the formats.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With such a dataset, you can preprocess or decompose the data by mentioning three different columns for the parts of the date, such as Year, Month and Day.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Python, the preprocessing of the data with different columns for the date will look like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#Convert to datetime object\ndf['Date'] = pd.to_datetime(df['Date'])\n\n#Decomposition\ndf['Year'] = df['Date'].dt.year\ndf['Month'] = df['Date'].dt.month\ndf['Day'] = df['Date'].dt.day\ndf[['Year','Month','Day']].head()<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gist.github.com\/quantra-go-algo\/ee83c1077463b213f9463d6933a5edce#file-decomposing-date-py\">Decomposing date.py\u00a0<\/a>hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Output:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Year<\/strong><\/td><td><strong>Month<\/strong><\/td><td><strong>Day<\/strong><\/td><\/tr><tr><td>2019<\/td><td>1<\/td><td>5<\/td><\/tr><tr><td>2019<\/td><td>3<\/td><td>8<\/td><\/tr><tr><td>2019<\/td><td>3<\/td><td>3<\/td><\/tr><tr><td>2019<\/td><td>1<\/td><td>27<\/td><\/tr><tr><td>2019<\/td><td>2<\/td><td>8<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In the output above, the dataset is in date format which is numerical. And because of decomposing the date into different parts such as Year, Month and Day, the machine learning model will be able to learn the date format.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The entire process mentioned above where data cleaning takes place can also be termed as data wrangling.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the field of machine learning, effective data preprocessing in Python is crucial for enhancing the quality and reliability of the input data, ultimately improving the performance of the model during training and inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Author:\u00a0Chainika Thakar<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Stay tuned for Part III to learn about Data cleaning vs data preprocessing.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Originally posted on\u00a0<a href=\"https:\/\/blog.quantinsti.com\/data-preprocessing\/\">QuantInsti<\/a>\u00a0blog.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data cleansing is an important step before you even begin the algorithmic trading process, which begins with historical data analysis to make the prediction model as accurate as possible.<\/p>\n","protected":false},"author":186,"featured_media":185725,"comment_status":"open","ping_status":"closed","sticky":true,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[339,343,349,338,341],"tags":[16746,7617,806,865,852,595],"contributors-categories":[13654],"class_list":["post-203181","post","type-post","status-publish","format-standard","has-post-thumbnail","category-data-science","category-programing-languages","category-python-development","category-ibkr-quant-news","category-quant-development","tag-data-cleaning","tag-data-preprocessing","tag-data-science","tag-github","tag-machine-learning","tag-python","contributors-categories-quantinsti"],"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v28.0) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Clean, Transform, Optimize: The Power of Data Preprocessing &#8211; Part II<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/203181\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Clean, Transform, Optimize: The Power of Data Preprocessing - Part II\" \/>\n<meta property=\"og:description\" content=\"Data cleansing is an important step before you even begin the algorithmic trading process, which begins with historical data analysis to make the prediction model as accurate as possible.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-07T16:30:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-03-07T16:33:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Contributor Author\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Contributor Author\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Contributor Author\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/e823e46b42ca381080387e794318a485\"\n\t            },\n\t            \"headline\": \"Clean, Transform, Optimize: The Power of Data Preprocessing &#8211; Part II\",\n\t            \"datePublished\": \"2024-03-07T16:30:18+00:00\",\n\t            \"dateModified\": \"2024-03-07T16:33:16+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/\"\n\t            },\n\t            \"wordCount\": 1518,\n\t            \"commentCount\": 0,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-yellow-background.jpg\",\n\t            \"keywords\": [\n\t                \"Data Cleaning\",\n\t                \"Data Preprocessing\",\n\t                \"Data Science\",\n\t                \"GitHub\",\n\t                \"Machine Learning\",\n\t                \"Python\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Data Science\",\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Development\"\n\t            ],\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"CommentAction\",\n\t                    \"name\": \"Comment\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#respond\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/\",\n\t            \"name\": \"Clean, Transform, Optimize: The Power of Data Preprocessing - Part II | IBKR Campus US\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-yellow-background.jpg\",\n\t            \"datePublished\": \"2024-03-07T16:30:18+00:00\",\n\t            \"dateModified\": \"2024-03-07T16:33:16+00:00\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-yellow-background.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-yellow-background.jpg\",\n\t            \"width\": 1000,\n\t            \"height\": 563,\n\t            \"caption\": \"Python Quant\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/e823e46b42ca381080387e794318a485\",\n\t            \"name\": \"Contributor Author\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/contributor-author\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Clean, Transform, Optimize: The Power of Data Preprocessing &#8211; Part II","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/203181\/","og_locale":"en_US","og_type":"article","og_title":"Clean, Transform, Optimize: The Power of Data Preprocessing - Part II","og_description":"Data cleansing is an important step before you even begin the algorithmic trading process, which begins with historical data analysis to make the prediction model as accurate as possible.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/","og_site_name":"IBKR Campus US","article_published_time":"2024-03-07T16:30:18+00:00","article_modified_time":"2024-03-07T16:33:16+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","type":"image\/jpeg"}],"author":"Contributor Author","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Contributor Author","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/"},"author":{"name":"Contributor Author","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/e823e46b42ca381080387e794318a485"},"headline":"Clean, Transform, Optimize: The Power of Data Preprocessing &#8211; Part II","datePublished":"2024-03-07T16:30:18+00:00","dateModified":"2024-03-07T16:33:16+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/"},"wordCount":1518,"commentCount":0,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","keywords":["Data Cleaning","Data Preprocessing","Data Science","GitHub","Machine Learning","Python"],"articleSection":["Data Science","Programming Languages","Python Development","Quant","Quant Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/","name":"Clean, Transform, Optimize: The Power of Data Preprocessing - Part II | IBKR Campus US","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","datePublished":"2024-03-07T16:30:18+00:00","dateModified":"2024-03-07T16:33:16+00:00","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/clean-transform-optimize-the-power-of-data-preprocessing-part-ii\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","width":1000,"height":563,"caption":"Python Quant"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/e823e46b42ca381080387e794318a485","name":"Contributor Author","url":"https:\/\/www.interactivebrokers.com\/campus\/author\/contributor-author\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-yellow-background.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/203181","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/186"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=203181"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/203181\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/185725"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=203181"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=203181"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=203181"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=203181"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}