{"id":182412,"date":"2022-11-22T15:40:00","date_gmt":"2022-11-22T20:40:00","guid":{"rendered":"https:\/\/ibkrcampus.com\/traders-insight\/sklearn-an-introduction-guide-to-machine-learning\/"},"modified":"2023-02-13T17:09:41","modified_gmt":"2023-02-13T22:09:41","slug":"sklearn-an-introduction-guide-to-machine-learning","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/","title":{"rendered":"Sklearn \u2013 An Introduction Guide to Machine Learning"},"content":{"rendered":"\n<p><em>The article &#8220;<a href=\"https:\/\/algotrading101.com\/learn\/sklearn-guide\/\">Sklearn \u2013 An Introduction Guide to Machine Learning<\/a>&#8221; first appeared on AlgoTrading101 Blog.<\/em><\/p>\n\n\n\n<p><strong><em>Excerpt<\/em><\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-is-sklearn\">What is Sklearn?<\/h2>\n\n\n\n<p>Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.<\/p>\n\n\n\n<p>It is also one of the most used machine learning libraries and is built on top of SciPy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-is-sklearn-used-for\">What is Sklearn used for?<\/h2>\n\n\n\n<p>The Sklearn Library is mainly used for modeling data and it provides efficient tools that are easy to use for any kind of predictive data analysis.<\/p>\n\n\n\n<p>The main use cases of this library can be categorized into 6 categories which are the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Preprocessing<\/li>\n\n\n\n<li>Regression<\/li>\n\n\n\n<li>Classification<\/li>\n\n\n\n<li>Clustering<\/li>\n\n\n\n<li>Model Selection<\/li>\n\n\n\n<li>Dimensionality Reduction<\/li>\n<\/ul>\n\n\n\n<p>As this article is mainly aimed at beginners, we will stick to the core concepts of each category and explore some of its most popular features and algorithms.<\/p>\n\n\n\n<p>Advanced readers can use this article as a recollection of some of the main use cases and intuitions behind popular sklearn features that most ML practitioners couldn\u2019t live without.<\/p>\n\n\n\n<p>Each category will be explained in a beginner-friendly and illustrative way followed by the most used models, the intuition behind them, and hands-on experience. But first, we need to set up our sklearn library.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-to-download-sklearn-for-python\">How to download Sklearn for Python?<\/h2>\n\n\n\n<p>Sklearn can be obtained in Python by using the&nbsp;<code>pip install&nbsp;<\/code>function as shown below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pip install -U scikit-learn<\/code><\/pre>\n\n\n\n<p>Sklearn developers strongly advise using a virtual environment (venv) or a conda environment when working with the library as it helps to avoid potential conflicts with other packages.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to pick the best Sklearn model?<\/h2>\n\n\n\n<p>When it comes to picking the best Sklearn model, there are many factors that come into play that range from experience and data to the problem scope and math behind each algorithm.<\/p>\n\n\n\n<p>Sometimes all chosen algorithms can have similar results and, depending on the problem setting, you will need to pick the one that is the fastest or the one that generalizes the best on big data.<\/p>\n\n\n\n<p>It may happen that all of your promised models won\u2019t perform well enough and that you will simply need to combine multiple models (e.g. ensemble), make your own custom-made model, or go for a deep learning approach.<\/p>\n\n\n\n<p>As picking the right model is one of the foundations of your problem solving, it is wise to read-up on as many models and their uses as you can.<\/p>\n\n\n\n<p>As model selection would be an article, or even a book, for itself, I\u2019ll only provide some rough guidelines in the form of questions that you\u2019ll need to ask yourself when deciding which model to deploy.<\/p>\n\n\n\n<p><strong>How much data do you have?<\/strong><\/p>\n\n\n\n<p>Some models are better on smaller datasets while others require more data and tend to generalize better on larger datasets (e.g. SGD Regressor vs Lasso Regression).<\/p>\n\n\n\n<p><strong>What are the main characteristics of your data?<\/strong><\/p>\n\n\n\n<p>Is your data linear, quadratic, or all over the place? How do your distributions look like? Is your data made out of numbers or strings? Is the data labeled?<\/p>\n\n\n\n<p><strong>What kind of a problem are you solving?<\/strong><br><br>Are you trying to predict: which cat will push most jars of the table, is that a dog or a cat, or of which dog breeds are a group of dogs made up?<\/p>\n\n\n\n<p>All of these questions have different approaches and solutions. Thus we will explore later in the article the three main problem classifications:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a>regression<\/a><\/li>\n\n\n\n<li><a>classification<\/a><\/li>\n\n\n\n<li><a>clustering<\/a><\/li>\n<\/ul>\n\n\n\n<p><a><strong>How do your models perform when compared against each other?<\/strong><\/a><\/p>\n\n\n\n<p>You will see that scikit-learn comes equipped with functions that allow us to inspect each model on several characteristics and compare it to the other ones.<\/p>\n\n\n\n<p>Take note that scikit-learn has created a good <a href=\"https:\/\/scikit-learn.org\/stable\/tutorial\/machine_learning_map\/index.html\">algorithm cheat-sheet<\/a> that aids you in your model selection and I\u2019d advise having it near you at those troubling times.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn preprocessing \u2013 Prepare the data for analysis<\/h2>\n\n\n\n<p>When you think of data you probably have in mind a ginormous excel spreadsheet full of rows and columns with numbers in them. Well, the case is that data can come in a plethora of formats like images, videos and audio.<\/p>\n\n\n\n<p>The main job of data preprocessing is to turn this data into a readable format for our algorithm. A machine can\u2019t just \u201clisten in\u201d to an audiotape to learn voice recognition, rather it needs it to be converted numbers.<\/p>\n\n\n\n<p>The main building blocks of our dataset are called features which can be categorical or numerical. Simply put, categorical data is used to group data with similar characteristics while numerical data provides information with numbers.<\/p>\n\n\n\n<p>As the features come from two different categories, they need to be treated (preprocessed) in different ways. The best way to learn is to start coding along with me.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn feature encoding<\/h3>\n\n\n\n<p>Feature encoding is a method where we transform categorical variables into continuous ones. The most popular ways of doing so are known as One Hot Encoding and Label encoding.<\/p>\n\n\n\n<p>For example, a person can have features such as [\u201cmale\u201d, \u201cfemale], [\u201cfrom US\u201d, \u201cfrom UK\u201d], [\u201cuses Binance\u201d, \u201cuses Coinbase\u201d]. These features can be encoded as numbers e.g. [\u201cmale\u201d, \u201cfrom US\u201d, \u201cuses Coinbase\u201d] would be [0, 0, 1].<\/p>\n\n\n\n<p>This can be done by using the scikit-learn&nbsp;<code>OrdinalEncoder<\/code>() function as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install scikit-learn\nfrom sklearn import preprocessing\n\nX = &#91;&#91;'male', 'from US', 'uses Coinbase'], &#91;'female', 'from UK', 'uses Binance']]\nencode = preprocessing.OrdinalEncoder()\nencode.fit(X)\n\nencode.transform(&#91;&#91;'male', 'from UK', 'uses Coinbase']])\n\nOutput: array(&#91;&#91;1., 0., 1.]])<\/code><\/pre>\n\n\n\n<p>As you can see, it transformed the features into integers. But they are not continuous and can\u2019t be used with scikit-learn estimators. In order to fix this, a popular and most used method is one hot encoding.<\/p>\n\n\n\n<p>One hot encoding, also known as dummy encoding, can be obtained through the scikit-learn&nbsp;<code>OneHotEncoder<\/code>() function. It works by transforming each category with N possible values into N binary features where one category is represented as 1 and the rest as 0.<\/p>\n\n\n\n<p>The following example will hopefully make it clear:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>one_hot = preprocessing.OneHotEncoder()\none_hot.fit(X)\n\none_hot.transform(&#91;&#91;'male', 'from UK', 'uses Coinbase'],\n                   &#91;'female', 'from US', 'uses Binance']]).toarray()\n\nOutput: array(&#91;&#91;0., 1., 1., 0., 0., 1.],\n              &#91;1., 0., 0., 1., 1., 0.]])<\/code><\/pre>\n\n\n\n<p>To see what your encoded features are exactly you can always use the&nbsp;<code>.categories_<\/code>&nbsp;attribute as shown below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>one_hot.categories_\n\nOutput: &#91;array(&#91;'female', 'male'], dtype=object),\n         array(&#91;'from UK', 'from US'], dtype=object),\n         array(&#91;'uses Binance', 'uses Coinbase'], dtype=object)]<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn data scaling<\/h3>\n\n\n\n<p>Feature scaling is a preprocessing method used to normalize data as it helps by improving some machine learning models. The two most common scaling techniques are known as standardization and normalization.<\/p>\n\n\n\n<p>Standardization makes the values of each feature in the data have zero-mean and unit variance. This method is commonly used with algorithms such as SVMs and Logistic regression.<\/p>\n\n\n\n<p>Standardization is done by subtracting the mean from each feature and dividing it by the standard deviation. It\u2019s some basic statistics and math, but don\u2019t worry if you don\u2019t get it. There are many tutorials that cover it.<\/p>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/sklearn-algotrading101.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<p>In scikit-learn we use the StandardScaler() function to standardize the data. Let us create a random NumPy array and standardize the data by giving it a zero mean and unit variance.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\n\nscaler = preprocessing.StandardScaler()\nX = np.random.rand(3,4)\nX<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-1.jpg\" alt=\" class=\" class=\"wp-image-167125 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>X_scaled = scaler.fit_transform(X)\nX_scaled<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-2.jpg\" alt=\" class=\" class=\"wp-image-167126 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>print(f'The scaled mean is: {X_scaled.mean(axis=0)}nThe scaled variance is: {X_scaled.std(axis=0)}')<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-3.jpg\" alt=\" class=\" class=\"wp-image-167127 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<p>Wait for a second! Didn\u2019t you say that all mean values need to be 0?<\/p>\n\n\n\n<p>Well, in practice these values are so close to 0 that they can be viewed as zero. Moreover, due to limitations with numerical representations the scaler can only get the mean really close to a zero.<\/p>\n\n\n\n<p>Let\u2019s move onto the next scaling method called normalization. Normalization is a term with many definitions that change from one field to another and we are going to define it as follows:<\/p>\n\n\n\n<p>Normalization is a scaling technique in which values are shifted and rescaled so that they end up being between 0 and 1. It is also known as Min-Max scaling. In scikit-learn it can be applied with the&nbsp;<code>Normalizer()&nbsp;<\/code>function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>norm = preprocessing.Normalizer()\n\nX_norm = norm.transform(X)\nX_norm<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-4.jpg\" alt=\" class=\" class=\"wp-image-167128 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<p>So, which one is better? Well, it depends on your data and the problem you\u2019re trying to solve. Standardization is often good when the data is depicting a Normal distribution and vice versa. If in doubt, try both and see which one improves the model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Sklearn missing values<\/h3>\n\n\n\n<p>In scikit-learn we can use the&nbsp;<code>.impute<\/code>&nbsp;class to fill in the missing values. The most used functions would be the&nbsp;<code>SimpleImputer()<\/code>,&nbsp;<code>KNNImputer()<\/code>&nbsp;and&nbsp;<code>IterativeImputer()<\/code>.<\/p>\n\n\n\n<p>When you encounter a real-life dataset it will 100% have missing values in it that can be there for various reasons ranging from rage quits to bugs and mistakes.<\/p>\n\n\n\n<p>There are several ways to treat them. One way is to delete the whole row (candidate) from the dataset but it can be costly for small to average datasets as you can delete plenty of data.<\/p>\n\n\n\n<p>Some better ways would be to change the missing values with the mean or median of the dataset. You could also try, if possible, to categorize your subject into their subcategory and take the mean\/median of it as the new value.<\/p>\n\n\n\n<p>Let\u2019s use the&nbsp;<code>SimpleImputer()<\/code>&nbsp;to replace the missing value with the mean:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.impute import SimpleImputer\n\nimputer = SimpleImputer(missing_values=np.nan, strategy=\"mean\")\nimputer.fit_transform(&#91;&#91;10,np.nan],&#91;2,4],&#91;10,9]])<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-5.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<p>The&nbsp;<code>strategy<\/code>&nbsp;hyperparameter can be changed to median, most_frequent, and constant. But Igor, can we impute missing strings? Yes, you can!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\ndf = pd.DataFrame(&#91;&#91;'i', 'g'],\n                   &#91;'o', 'r'],\n                   &#91;'i', np.nan],\n                   &#91;np.nan, 'r']], dtype='category')\n\nimputer = SimpleImputer(strategy='most_frequent')\nimputer.fit_transform(df)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-6.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<p>If you want to keep track of the missing values and the positions they were in, you can use the&nbsp;<code>MissingIndicator()<\/code>&nbsp;function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.impute import MissingIndicator\n\n# Image the 3's were imputed by the SimpleImputer()\nY = np.array(&#91;&#91;3,1], \n              &#91;5,3],\n              &#91;9,4], \n              &#91;3,7]])\n\nmissing = MissingIndicator(missing_values=3)\nmissing.fit_transform(Y)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-7.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<p>The&nbsp;<code>IterateImputer()<\/code>&nbsp;is fancy, as it basically goes across the features and uses the missing feature as the label and other features as the inputs of a regression model. Then it predicts the value of the label for the number of iterations we specify.<\/p>\n\n\n\n<p>If you\u2019re not sure how regression algorithms work, don\u2019t worry as we will soon go over them. As the&nbsp;<code>IterativeImputer()<\/code>&nbsp;is an experimental feature we will need to enable it before use:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.experimental import enable_iterative_imputer\nfrom sklearn.impute import IterativeImputer\n\nimputer = IterativeImputer(max_iter=15, random_state=42)\nimputer.fit_transform((&#91;1,5],&#91;4,6],&#91;2, np.nan], &#91;np.nan, 8]))<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-8.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Sklearn train test split<\/h2>\n\n\n\n<p>In Sklearn the data can be split into test and training groups by using the&nbsp;<code>train_test_split()<\/code>&nbsp;function which is a part of the&nbsp;<code>model_selection<\/code>&nbsp;class.<\/p>\n\n\n\n<p>But why do we need to split the data into two groups? Well, the training data is the data on which we fit our model and it learns on it. In order to evaluate how the model performs on unseen data, we use test data.<\/p>\n\n\n\n<p>An important thing, in most cases, is to allocate more data to the training set. When speaking of the ratio of this allocation there aren\u2019t any hard rules. It all depends on the size of your dataset.<\/p>\n\n\n\n<p>The most used allocation ratio is 80% for training and 20% for testing. Have in mind that most people use the training\/development set split but name the dev set as the test set. This is more of a conceptual mistake.<\/p>\n\n\n\n<p>Now let us create a random dataset and split it into training and testing sets:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.datasets import make_blobs\nfrom sklearn.model_selection import train_test_split\n\n# Create a random dataset\nX, y = make_blobs(n_samples=1500)\n\n# Split the data\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n\nprint(f'X training set {X_train.shape}nX testing set {X_test.shape}ny training set {y_train.shape}ny testing set {y_test.shape}')<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image img-twothird\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-9.jpg\" alt=\"=\"\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\"><\/figure>\n\n\n\n<p>If your dataset is big enough you\u2019ll often be fine with using this way to split the data. But some datasets come with a severe imbalance in them.<\/p>\n\n\n\n<p>For example, if you\u2019re building a model to detect outliers that default their credit cards you will most often have a very small percentage of them in your data.<\/p>\n\n\n\n<p>This means that the&nbsp;<code>train_test_split()<\/code>&nbsp;function will most likely allocate too little of the outliers to your training set and the ML algorithm won\u2019t learn to detect them efficiently. Let\u2019s simulate a dataset like that:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.datasets import make_classification\nfrom collections import Counter\n\n# Create an imablanced dataset\nX, y = make_classification(n_samples=1000, weights=&#91;0.95], flip_y=0, random_state=42)\nprint(f'Number of y before splitting is {Counter(y)}')\n\n# Split the data the usual way\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\nprint(f'Number of y in the training set after splitting is {Counter(y_train)}')\nprint(f'Number of y in the testing set after splitting is {Counter(y_test)}')<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-11.jpg\" alt=\" class=\" class=\"wp-image-167268 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<p>As you can see, the training set has 43 examples of y while the testing set has only 7! In order to combat this, we can split the data into training and testing by stratification which is done according to y.<\/p>\n\n\n\n<p>This means that y examples will be adequately stratified in both training and testing sets (20% of y goes to the test set). In scikit-learn this is done by adding the&nbsp;<code>stratify<\/code>&nbsp;argument as shown below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Split the data by stratification\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)\nprint(f'Number of y in the training set after splitting is {Counter(y_train)}')\nprint(f'Number of y in the testing set after splitting is {Counter(y_test)}')<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/Sklearn-data-scaling-algotrading101-12.jpg\" alt=\" class=\" class=\"wp-image-167271 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<p>For a more in-depth guide and understanding of the train test split and cross-validation, please visit the following article that is found on our blog:  <a href=\"https:\/\/algotrading101.com\/learn\/train-test-split\/\">https:\/\/algotrading101.com\/learn\/train-test-split\/<\/a><\/p>\n\n\n\n<p>For more information about scikit-learn preprocessing functions go&nbsp;<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing\">here<\/a>.<\/p>\n\n\n\n<p><em>Visit <a href=\"https:\/\/algotrading101.com\/learn\/sklearn-guide\/\">AlgoTrading101<\/a> to read the rest of the article. <\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.<\/p>\n","protected":false},"author":815,"featured_media":182425,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[339,343,349,338,350,341,351,352,344],"tags":[865,852,1225,1224,595,6810],"contributors-categories":[13746],"class_list":{"0":"post-182412","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science","8":"category-programing-languages","9":"category-python-development","10":"category-ibkr-quant-news","11":"category-quant-asia-pacific","12":"category-quant-development","13":"category-quant-europe","14":"category-quant-north-america","15":"category-quant-regions","16":"tag-github","17":"tag-machine-learning","18":"tag-numpy","19":"tag-pandas","20":"tag-python","21":"tag-sklearn","22":"contributors-categories-algotrading101"},"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.5) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Sklearn \u2013 An Introduction Guide to Machine Learning<\/title>\n<meta name=\"description\" content=\"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/182412\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Sklearn \u2013 An Introduction Guide to Machine Learning\" \/>\n<meta property=\"og:description\" content=\"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-22T20:40:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-02-13T22:09:41+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Igor Radovanovic\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Sklearn \u2013 An Introduction Guide to Machine Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Igor Radovanovic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Igor Radovanovic\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/b43b33f424bad38d84a7b78eb0193592\"\n\t            },\n\t            \"headline\": \"Sklearn \u2013 An Introduction Guide to Machine Learning\",\n\t            \"datePublished\": \"2022-11-22T20:40:00+00:00\",\n\t            \"dateModified\": \"2023-02-13T22:09:41+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/\"\n\t            },\n\t            \"wordCount\": 1076,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-blue-button.jpg\",\n\t            \"keywords\": [\n\t                \"GitHub\",\n\t                \"Machine Learning\",\n\t                \"NumPy\",\n\t                \"Pandas\",\n\t                \"Python\",\n\t                \"sklearn\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Data Science\",\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Asia Pacific\",\n\t                \"Quant Development\",\n\t                \"Quant Europe\",\n\t                \"Quant North America\",\n\t                \"Quant Regions\"\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/\",\n\t            \"name\": \"Sklearn \u2013 An Introduction Guide to Machine Learning\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-blue-button.jpg\",\n\t            \"datePublished\": \"2022-11-22T20:40:00+00:00\",\n\t            \"dateModified\": \"2023-02-13T22:09:41+00:00\",\n\t            \"description\": \"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/sklearn-an-introduction-guide-to-machine-learning\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-blue-button.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/02\\\/python-blue-button.jpg\",\n\t            \"width\": 1000,\n\t            \"height\": 563,\n\t            \"caption\": \"Sklearn \u2013 An Introduction Guide to Machine Learning\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/b43b33f424bad38d84a7b78eb0193592\",\n\t            \"name\": \"Igor Radovanovic\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/igor-radovanovic\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Sklearn \u2013 An Introduction Guide to Machine Learning","description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/182412\/","og_locale":"en_US","og_type":"article","og_title":"Sklearn \u2013 An Introduction Guide to Machine Learning","og_description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/","og_site_name":"IBKR Campus US","article_published_time":"2022-11-22T20:40:00+00:00","article_modified_time":"2023-02-13T22:09:41+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","type":"image\/jpeg"}],"author":"Igor Radovanovic","twitter_card":"summary_large_image","twitter_title":"Sklearn \u2013 An Introduction Guide to Machine Learning","twitter_misc":{"Written by":"Igor Radovanovic","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/"},"author":{"name":"Igor Radovanovic","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/b43b33f424bad38d84a7b78eb0193592"},"headline":"Sklearn \u2013 An Introduction Guide to Machine Learning","datePublished":"2022-11-22T20:40:00+00:00","dateModified":"2023-02-13T22:09:41+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/"},"wordCount":1076,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","keywords":["GitHub","Machine Learning","NumPy","Pandas","Python","sklearn"],"articleSection":["Data Science","Programming Languages","Python Development","Quant","Quant Asia Pacific","Quant Development","Quant Europe","Quant North America","Quant Regions"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/","name":"Sklearn \u2013 An Introduction Guide to Machine Learning","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","datePublished":"2022-11-22T20:40:00+00:00","dateModified":"2023-02-13T22:09:41+00:00","description":"Sklearn (scikit-learn) is a Python library that provides a wide range of unsupervised and supervised machine learning algorithms.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/sklearn-an-introduction-guide-to-machine-learning\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","width":1000,"height":563,"caption":"Sklearn \u2013 An Introduction Guide to Machine Learning"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/b43b33f424bad38d84a7b78eb0193592","name":"Igor Radovanovic","url":"https:\/\/www.interactivebrokers.com\/campus\/author\/igor-radovanovic\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/02\/python-blue-button.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/182412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/815"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=182412"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/182412\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/182425"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=182412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=182412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=182412"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=182412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}