{"id":59304,"date":"2020-09-11T09:45:09","date_gmt":"2020-09-11T13:45:09","guid":{"rendered":"https:\/\/ibkrcampus.com\/?p=59304"},"modified":"2022-11-21T09:46:17","modified_gmt":"2022-11-21T14:46:17","slug":"bag-of-words-approach-python-code-limitations-ii","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/","title":{"rendered":"Bag of Words: Approach, Python Code, Limitations &#8211; Part II"},"content":{"rendered":"\n<p><em>Check out the <a href=\"\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations\/\">first installment<\/a> of this series to get started with sentiment analysis. <\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Limitations of Bag of Words<\/strong><\/h2>\n\n\n\n<p>Consider deploying the Bag of Words method to generate vectors for large documents. The resultant vectors will be of large dimension and will contain far too many null values resulting in sparse vectors. This is also observed in the above sample example.<\/p>\n\n\n\n<p>Apart from resulting in sparse representations, Bag of Words does a poor job in making sense of text data. For example, consider the two sentences: &#8220;I love playing football and hate cricket&#8221; and it&#8217;s vice-versa &#8220;I love playing cricket and hate football&#8221;. Bag of Words approach will result in similar vectorized representations although both sentences carry different meanings. Attention-based&nbsp;<a href=\"https:\/\/quantra.quantinsti.com\/glossary\/Deep-learning\" target=\"_blank\" rel=\"noreferrer noopener\">deep learning<\/a>&nbsp;models like BERT are used to solve the problem of contextual awareness.<\/p>\n\n\n\n<p>We can solve the problem of sparse vectors to some extent using the techniques discussed below:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"converting-all-words-to-the-lower-case\">Converting all words to the lower case<\/h3>\n\n\n\n<p>While tokenizing documents, we may encounter similar words but in different cases, eg: upper \u2018CASE\u2019 or lower \u2018case\u2019 or title \u2018Case\u2019. While the word case is common, different tokens will be generated for them. This increases the size of vocabulary and consequently the dimension of generated word vectors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"removing-stop-words\">Removing Stop Words<\/h3>\n\n\n\n<p>Stop words include common occurring words such as \u2018the\u2019, \u2018is\u2019, etc. Removing such words from vocabulary results in vectors of lesser dimension. Stop words are not exhaustive, and one can specify custom stop words while working on their Bag of Words model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"stemming-and-lemmatization\">Stemming and Lemmatization<\/h3>\n\n\n\n<p>While the aim of both the techniques is to result in a root word from the original word, the method deployed in doing so is different.&nbsp;<em>Stemming<\/em>&nbsp;does this by stripping the suffix of words under consideration. For example: \u2018playing\u2019 becomes \u2018play\u2019 and so on. There is no standard procedure to do stemming and various stemmers are available. Often stemming results in words that do not mean anything.&nbsp;<em>Lemmatization<\/em>&nbsp;takes a different approach by incorporating linguistics into consideration and results in meaningful root words. This method is relatively difficult as it requires constructing a dictionary to achieve the desired results.<\/p>\n\n\n\n<p>Below is an example of&nbsp;<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.CountVectorizer.html\" target=\"_blank\" rel=\"noreferrer noopener\">Scikit-learn\u2019s CountVectorizer<\/a>&nbsp;that has added functionality of removing stop words and converting words into the lower case before coming up with the vectorized representation of documents.<\/p>\n\n\n\n<p style=\"background-color:#fcfcdb;font-size:11px\" class=\"has-background\">\n# Import CountVectorizer from sklearn<br>\nfrom sklearn.feature_extraction.text import CountVectorizer<br>\ncv = CountVectorizer(stop_words=&#8217;english&#8217;, lowercase=True)<br>\nword_count = cv.fit_transform(corpus) # Fit the model<br><br>\n\nprint(cv.get_feature_names()) # Print all the words in vocabulary\n<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1100\" height=\"32\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1-1100x32.png\" alt=\"\" class=\"wp-image-59323 lazyload\" data-srcset=\"https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1-1100x32.png 1100w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1-700x20.png 700w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1-300x9.png 300w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1-768x22.png 768w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/sklearn_remove_stop_words-1.png 1199w\" data-sizes=\"(max-width: 1100px) 100vw, 1100px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1100px; aspect-ratio: 1100\/32;\" \/><\/figure>\n\n\n\n<p style=\"background-color:#fcfcdb;font-size:11px\" class=\"has-background\">\ndf_ = pd.DataFrame(word_count.toarray(), columns = cv.get_feature_names())<br>\ndf_ # bag of words vectorized representation\n<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"826\" height=\"128\" data-src=\"\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/frequency_bag_of_words3-1.png\" alt=\"\" class=\"wp-image-59328 lazyload\" data-srcset=\"https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/frequency_bag_of_words3-1.png 826w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/frequency_bag_of_words3-1-700x108.png 700w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/frequency_bag_of_words3-1-300x46.png 300w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/09\/frequency_bag_of_words3-1-768x119.png 768w\" data-sizes=\"(max-width: 826px) 100vw, 826px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 826px; aspect-ratio: 826\/128;\" \/><\/figure>\n\n\n\n<p>Notice the difference in the number of words in vocabulary as compared to the fundamental approach.<\/p>\n\n\n\n<p><em>Stay tuned for the next installment in this series, in which the author will discuss Bag of Words vs Word2Vec.<\/em><\/p>\n\n\n\n<p>To download the complete Python code, visit QuantInsti:&nbsp;<a href=\"https:\/\/blog.quantinsti.com\/bag-of-words\/\">https:\/\/blog.quantinsti.com\/bag-of-words\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.<\/p>\n","protected":false},"author":431,"featured_media":33705,"comment_status":"closed","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[339,343,349,338,350,341,344],"tags":[851,8224,8229,8377,806,4582,8376,8378,2860,1224,595,4412,1038,7648,7649,8375,8374,8228,8226,8225,8227],"contributors-categories":[13654],"class_list":{"0":"post-59304","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science","8":"category-programing-languages","9":"category-python-development","10":"category-ibkr-quant-news","11":"category-quant-asia-pacific","12":"category-quant-development","13":"category-quant-regions","14":"tag-algo-trading","15":"tag-bag-of-words","16":"tag-corpus","17":"tag-countvectorizer","18":"tag-data-science","19":"tag-dataframe","20":"tag-lemmatization","21":"tag-machine-learning-natural-language-processing","22":"tag-nlp","23":"tag-pandas","24":"tag-python","25":"tag-scikit-learn","26":"tag-sentiment-analysis","27":"tag-sentiment-data","28":"tag-sentiment-trading","29":"tag-stemming","30":"tag-stemming-and-lemmatization","31":"tag-tokenization","32":"tag-vectorized-text-data","33":"tag-word-cloud","34":"tag-word2vec","35":"contributors-categories-quantinsti"},"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Bag of Words: Approach, Python Code, Limitations &#8211; Part II<\/title>\n<meta name=\"description\" content=\"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/59304\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Bag of Words: Approach, Python Code, Limitations - Part II | IBKR Quant Blog\" \/>\n<meta property=\"og:description\" content=\"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2020-09-11T13:45:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-11-21T14:46:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"540\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Naman Swarnkar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Naman Swarnkar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Naman Swarnkar\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/0711c8311f398d8eb95dd6f0eef86b50\"\n\t            },\n\t            \"headline\": \"Bag of Words: Approach, Python Code, Limitations &#8211; Part II\",\n\t            \"datePublished\": \"2020-09-11T13:45:09+00:00\",\n\t            \"dateModified\": \"2022-11-21T14:46:17+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/\"\n\t            },\n\t            \"wordCount\": 510,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2020\\\/01\\\/python-circuits-hand.jpg\",\n\t            \"keywords\": [\n\t                \"Algo Trading\",\n\t                \"Bag of Words\",\n\t                \"Corpus\",\n\t                \"CountVectorizer\",\n\t                \"Data Science\",\n\t                \"Dataframe\",\n\t                \"Lemmatization\",\n\t                \"Machine Learning Natural Language Processing\",\n\t                \"NLP\",\n\t                \"Pandas\",\n\t                \"Python\",\n\t                \"Scikit-learn\",\n\t                \"Sentiment Analysis\",\n\t                \"Sentiment Data\",\n\t                \"Sentiment Trading\",\n\t                \"Stemming\",\n\t                \"Stemming and Lemmatization\",\n\t                \"Tokenization\",\n\t                \"Vectorized Text Data\",\n\t                \"Word Cloud\",\n\t                \"Word2Vec\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Data Science\",\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Asia Pacific\",\n\t                \"Quant Development\",\n\t                \"Quant Regions\"\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/\",\n\t            \"name\": \"Bag of Words: Approach, Python Code, Limitations - Part II | IBKR Quant Blog\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2020\\\/01\\\/python-circuits-hand.jpg\",\n\t            \"datePublished\": \"2020-09-11T13:45:09+00:00\",\n\t            \"dateModified\": \"2022-11-21T14:46:17+00:00\",\n\t            \"description\": \"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/bag-of-words-approach-python-code-limitations-ii\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2020\\\/01\\\/python-circuits-hand.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2020\\\/01\\\/python-circuits-hand.jpg\",\n\t            \"width\": 900,\n\t            \"height\": 540,\n\t            \"caption\": \"Python\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/0711c8311f398d8eb95dd6f0eef86b50\",\n\t            \"name\": \"Naman Swarnkar\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/namanswarnkar\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Bag of Words: Approach, Python Code, Limitations &#8211; Part II","description":"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/59304\/","og_locale":"en_US","og_type":"article","og_title":"Bag of Words: Approach, Python Code, Limitations - Part II | IBKR Quant Blog","og_description":"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/","og_site_name":"IBKR Campus US","article_published_time":"2020-09-11T13:45:09+00:00","article_modified_time":"2022-11-21T14:46:17+00:00","og_image":[{"width":900,"height":540,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","type":"image\/jpeg"}],"author":"Naman Swarnkar","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Naman Swarnkar","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/"},"author":{"name":"Naman Swarnkar","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/0711c8311f398d8eb95dd6f0eef86b50"},"headline":"Bag of Words: Approach, Python Code, Limitations &#8211; Part II","datePublished":"2020-09-11T13:45:09+00:00","dateModified":"2022-11-21T14:46:17+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/"},"wordCount":510,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","keywords":["Algo Trading","Bag of Words","Corpus","CountVectorizer","Data Science","Dataframe","Lemmatization","Machine Learning Natural Language Processing","NLP","Pandas","Python","Scikit-learn","Sentiment Analysis","Sentiment Data","Sentiment Trading","Stemming","Stemming and Lemmatization","Tokenization","Vectorized Text Data","Word Cloud","Word2Vec"],"articleSection":["Data Science","Programming Languages","Python Development","Quant","Quant Asia Pacific","Quant Development","Quant Regions"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/","name":"Bag of Words: Approach, Python Code, Limitations - Part II | IBKR Quant Blog","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","datePublished":"2020-09-11T13:45:09+00:00","dateModified":"2022-11-21T14:46:17+00:00","description":"QuantInsti demonstrates how to convert all words to the lower case, how to remove Stop Words, and Stemming and Lemmatization.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/bag-of-words-approach-python-code-limitations-ii\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","width":900,"height":540,"caption":"Python"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/0711c8311f398d8eb95dd6f0eef86b50","name":"Naman Swarnkar","url":"https:\/\/www.interactivebrokers.com\/campus\/author\/namanswarnkar\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2020\/01\/python-circuits-hand.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/59304","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/431"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=59304"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/59304\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/33705"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=59304"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=59304"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=59304"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=59304"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}