{"id":200750,"date":"2024-01-04T11:05:22","date_gmt":"2024-01-04T16:05:22","guid":{"rendered":"https:\/\/ibkrcampus.com\/?p=200750"},"modified":"2024-01-04T11:05:34","modified_gmt":"2024-01-04T16:05:34","slug":"word-frequency-analysis","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/","title":{"rendered":"Word Frequency Analysis"},"content":{"rendered":"\n<p>In a&nbsp;<a href=\"https:\/\/theautomatic.net\/2017\/08\/24\/scraping-articles-about-stocks\/\">previous article<\/a>, we talked about using Python to scrape stock-related articles from the web. As an extension of this idea, we\u2019re going to show you how to use the&nbsp;<a href=\"https:\/\/www.nltk.org\/\">NLTK<\/a>&nbsp;package to figure out how often different words occur in text, using scraped stock articles.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-initial-setup\">Initial Setup<\/h2>\n\n\n\n<p>Let\u2019s import the&nbsp;<strong>NLTK<\/strong>&nbsp;package, along with&nbsp;<strong>requests<\/strong>&nbsp;and&nbsp;<strong>BeautifulSoup<\/strong>, which we\u2019ll need to scrape the stock articles.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">'''load packages'''\nimport nltk\nimport requests\nfrom bs4 import BeautifulSoup<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Pulling the data we\u2019ll need<\/h2>\n\n\n\n<p>Below, we\u2019re copying code from my&nbsp;<a href=\"https:\/\/theautomatic.net\/2017\/08\/24\/scraping-articles-about-stocks\/\">scraping stocks<\/a>&nbsp;article. This gives us a function,&nbsp;<em>scrape_all_articles<\/em>&nbsp;(along with two other helper functions), which we can use to pull the actual raw text from articles linked to from&nbsp;<a href=\"https:\/\/www.nasdaq.com\/\">NASDAQ\u2019s website<\/a>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def scrape_news_text(news_url):\n \n    news_html = requests.get(news_url).content\n \n    '''convert html to BeautifulSoup object'''\n    news_soup = BeautifulSoup(news_html , 'lxml')\n \n    paragraphs = [par.text for par in news_soup.find_all('p')]\n    news_text = '\\n'.join(paragraphs)\n \n    return news_text\n \ndef get_news_urls(links_site):\n    '''scrape the html of the site'''\n    resp = requests.get(links_site)\n \n    if not resp.ok:\n        return None\n \n    html = resp.content\n \n    '''convert html to BeautifulSoup object'''\n    soup = BeautifulSoup(html , 'lxml')\n \n    '''get list of all links on webpage'''\n    links = soup.find_all('a')\n \n    urls = [link.get('href') for link in links]\n    urls = [url for url in urls if url is not None]\n \n    '''Filter the list of urls to just the news articles'''\n    news_urls = [url for url in urls if '\/article\/' in url]\n \n    return news_urls\n \ndef scrape_all_articles(ticker , upper_page_limit = 5):\n \n    landing_site = 'https:\/\/www.nasdaq.com\/symbol\/' + ticker + '\/news-headlines'\n \n    all_news_urls = get_news_urls(landing_site)\n \n    current_urls_list = all_news_urls.copy()\n \n    index = 2\n \n    '''Loop through each sequential page, scraping the links from each'''\n    while (current_urls_list is not None) and (current_urls_list != []) and \\\n        (index &lt;= upper_page_limit):\n \n        '''Construct URL for page in loop based off index'''\n        current_site = landing_site + '?page=' + str(index)\n        current_urls_list = get_news_urls(current_site)\n \n        '''Append current webpage's list of urls to all_news_urls'''\n        all_news_urls = all_news_urls + current_urls_list\n \n        index = index + 1\n \n    all_news_urls = list(set(all_news_urls))\n \n    '''Now, we have a list of urls, we need to actually scrape the text'''\n    all_articles = [scrape_news_text(news_url) for news_url in all_news_urls]\n \n    return all_articles<\/pre>\n\n\n\n<p>Let\u2019s run our function to pull a few articles on Netflix (ticker symbol \u2018NFLX\u2019).<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">articles = scrape_all_articles('NFLX' , 10)<\/pre>\n\n\n\n<p>Above, we use our function to search through the first ten pages of&nbsp;<a href=\"https:\/\/www.nasdaq.com\/symbol\/nflx\/news-headlines\">NASDAQ\u2019s<\/a>&nbsp;listing of articles for Netflix. This gives us a total of 102 articles (at the time of this writing). The variable,&nbsp;<em>articles<\/em>, contains a list of the raw text of each article. We can view a sample one by printing the following:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(articles[0])<\/pre>\n\n\n\n<p>Now, let\u2019s set&nbsp;<em>article<\/em>&nbsp;equal to one of the articles we have.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">article = articles[0]<\/pre>\n\n\n\n<p>To get word frequencies of this article, we are going to perform an operation called&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Lexical_analysis#Tokenization\">tokenization<\/a>. Tokenization effectively breaks a string of text into individual words, which we\u2019ll need to calculate word frequencies. To tokenize&nbsp;<em>article<\/em>, we use the&nbsp;<em>nltk.tokenize.word_tokenize<\/em>&nbsp;method.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokens = nltk.tokenize.word_tokenize(article)<\/pre>\n\n\n\n<p>Now, if you print out&nbsp;<em>tokens<\/em>, you\u2019ll see that it includes a lot of words like \u2018the\u2019, \u2018a\u2019, \u2018an\u2019 etc. These are known as \u2018stop words.\u2019 We can filter these out of&nbsp;<em>tokens<\/em>&nbsp;using&nbsp;<em>stopwords<\/em>&nbsp;from&nbsp;<em>nltk.corpus<\/em>. Let\u2019s also make all the words upper case. This will allow us to avoid case sensitivity issues when we get any word frequency distributions.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from nltk.corpus import stopwords\n \n'''Get list of English stop words '''\ntake_out = stopwords.words('english')\n \n'''Make all words in tokens uppercase'''\ntokens = [word.upper() for word in tokens]\n \n'''Make all stop words upper case'''\ntake_out = [word.upper() for word in take_out]\n \n'''Filter out stop words from tokens list'''\ntokens = [word for word in tokens if word not in take_out]<\/pre>\n\n\n\n<p>*NLTK also has functionality to filter out stop words from other languages, as well.<\/p>\n\n\n\n<p>In addition to filtering out stop words, we also probably want to get rid of punctuation (e.g. commas etc.). This can be done by filtering out any elements in&nbsp;<em>tokens<\/em>&nbsp;that are in&nbsp;<em>string.punctuation<\/em>, which contains a list of common punctuation forms.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokens = [word for word in tokens if word not in string.punctuation]\n \ntokens = [word for word in tokens if word[0] not in string.punctuation]<\/pre>\n\n\n\n<p>Now, we\u2019re ready to get the word frequency distribution of the article in question. This is done using the&nbsp;<em>nltk.FreqDist<\/em>&nbsp;method, like below. The&nbsp;<em>nltk.FreqDist<\/em>&nbsp;method returns a dictionary, where each key is each uniquely occurring word in the text, while the corresponding values are how many times each of those words appear. Setting this dictionary equal to&nbsp;<em>word_frequencies<\/em>, we sort the result as a list of tuples (<em>word_frequencies.items()<\/em>) by the frequency of each word in descending order.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">'''Returns a dictionary of words mapped to how\n   often they occur'''\nword_frequencies = nltk.FreqDist(tokens)\n \n'''Sort the above result by the frequency of each word'''\nsorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] ,\n                       reverse = True)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-getting-a-function-to-calculate-word-frequency\">Getting a function to calculate word frequency\u2026<\/h2>\n\n\n\n<p>Let\u2019s create a function from what we did that takes a single article, and returns the sorted word frequencies.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_word_frequecy(article):\n \n    tokens = nltk.tokenize.word_tokenize(article)\n \n    '''Get list of English stop words '''\n    take_out = stopwords.words('english')\n    take_out = [word.upper() for word in take_out]\n \n    '''Convert each item in tokens to uppercase'''\n    tokens = [word.upper() for word in tokens]\n \n    '''Filter out stop words and punctuation '''\n    tokens = [word for word in tokens if word not in take_out]\n \n    tokens = [word for word in tokens if word not in string.punctuation]\n \n    tokens = [word for word in tokens if word[0] not in string.punctuation]\n \n    '''Get word frequency distribution'''\n    word_frequencies = nltk.FreqDist(tokens)\n \n    '''Sort word frequency distribution by number of times each word occurs'''\n    sorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] ,\n                           reverse = True)\n \n    return sorted_counts<\/pre>\n\n\n\n<p>Now, we could run our function across every article in our list, like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">articles = [article for article in articles if article != '']\nresults = [get_word_frequency(article) for article in articles]<\/pre>\n\n\n\n<p>The&nbsp;<em>results<\/em>&nbsp;variable contains word frequencies for each individual article. Using this information, we can get the most frequently occurring word in each article.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">most_frequent = [pair[0] for pair in results]\nmost_frequent = [x[0] for x in most_frequent]<\/pre>\n\n\n\n<p>Next, we can figure out the most common top-occurring words across the articles.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">most_frequent = nltk.FreqDist(most_frequent)\nmost_frequent = sorted(most_frequent.items() , key = lambda x: x[1] , \n                       reverse = True)<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-filtering-out-articles-using-word-frequency\">Filtering out articles using word frequency<\/h3>\n\n\n\n<p>If you print out&nbsp;<em>most_frequent<\/em>, you can see the words&nbsp;<strong>\u2018NETFLIX\u2019<\/strong>,&nbsp;<strong>\u2018PERCENT\u2019<\/strong>, and&nbsp;<strong>\u2018STOCK\u2019<\/strong>&nbsp;are at the top of the list. Using word frequencies could be useful in giving a quick check to test whether an article actually has much to do with the stock that it\u2019s listed under. For instance, some of the Netflix articles may be linked to the stock because they mentioned it in passing, or in a minor part of the text, while actually having more to do with another stock(s). Using our frequency function above, we could filter out articles that mention the stock name infrequently, like in the snippet below.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">'''Create a dictionary that maps each article to its word frequency distribution'''\narticle_to_freq = {article:freq for article,freq in zip(articles , results)}\n \n'''Filter out articles that don't mention 'Netflix' at least 3 times'''\narticle_to_freq = {article:freq for article,freq in\n                           article_to_freq.items() if freq &gt;= 3}<\/pre>\n\n\n\n<p>Note, this isn\u2019t a perfect form of&nbsp;<a href=\"https:\/\/www.kdnuggets.com\/2016\/07\/text-mining-101-topic-modeling.html\">topic modeling<\/a>, but it is something you can do really quickly to make educated guesses about whether an article actually has to do with the topic you want. You can also make this process better by filtering out articles that don\u2019t contain other words, as well. For instance, if you\u2019re looking for articles specifically about Netflix\u2019s stock, you might not want to include articles about new shows etc. on Netflix. So, you could maybe filter out articles that don\u2019t mention words like \u2018stock\u2019 or \u2018investing.\u2019<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">One last note\u2026<\/h2>\n\n\n\n<p>Another way of thinking about word frequency in our situation would be to get word counts across all articles at once. You can do this easily enough by concatenating (or joining together) each article in our list.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">overall_text = ' '.join(articles)\ntop_words = get_word_frequency(overall_text)<\/pre>\n\n\n\n<p>This type of analysis can go&nbsp;<em>much<\/em>&nbsp;deeper into the world of&nbsp;<a href=\"https:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=opensourceautomation-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0596516495&amp;linkId=830f7231d7af2c846d1989592aecdc4b\">natural language processing<\/a>, but that would go well beyond a single blog post, so that\u2019s the end for now!<\/p>\n\n\n\n<p><em>Originally posted on <a href=\"https:\/\/theautomatic.net\/2017\/10\/12\/word-frequency-analysis\/\">TheAutomatic.net<\/a> blog.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s import the NLTK package, along with requests and BeautifulSoup, which we&#8217;ll need to scrape the stock articles.<\/p>\n","protected":false},"author":388,"featured_media":186094,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[343,349,338,341],"tags":[14986,806,16509,595,16510],"contributors-categories":[13695],"class_list":{"0":"post-200750","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-programing-languages","8":"category-python-development","9":"category-ibkr-quant-news","10":"category-quant-development","11":"tag-beautifulsoup","12":"tag-data-science","13":"tag-nltk-package","14":"tag-python","15":"tag-word-frequency-analysis","16":"contributors-categories-theautomatic-net"},"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.4) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Word Frequency Analysis | IBKR Quant<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/200750\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Word Frequency Analysis\" \/>\n<meta property=\"og:description\" content=\"Let&#039;s import the NLTK package, along with requests and BeautifulSoup, which we&#039;ll need to scrape the stock articles.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-04T16:05:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-01-04T16:05:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Andrew Treadway\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Andrew Treadway\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Andrew Treadway\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/d4018570a16fb867f1c08412fc9c64bc\"\n\t            },\n\t            \"headline\": \"Word Frequency Analysis\",\n\t            \"datePublished\": \"2024-01-04T16:05:22+00:00\",\n\t            \"dateModified\": \"2024-01-04T16:05:34+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/\"\n\t            },\n\t            \"wordCount\": 871,\n\t            \"commentCount\": 0,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/03\\\/python-quant-laptop-coding.jpg\",\n\t            \"keywords\": [\n\t                \"BeautifulSoup\",\n\t                \"Data Science\",\n\t                \"NLTK package\",\n\t                \"Python\",\n\t                \"Word Frequency Analysis\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Development\"\n\t            ],\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"CommentAction\",\n\t                    \"name\": \"Comment\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#respond\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/\",\n\t            \"name\": \"Word Frequency Analysis | IBKR Campus US\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/03\\\/python-quant-laptop-coding.jpg\",\n\t            \"datePublished\": \"2024-01-04T16:05:22+00:00\",\n\t            \"dateModified\": \"2024-01-04T16:05:34+00:00\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/word-frequency-analysis\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/03\\\/python-quant-laptop-coding.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2023\\\/03\\\/python-quant-laptop-coding.jpg\",\n\t            \"width\": 1000,\n\t            \"height\": 563,\n\t            \"caption\": \"Python Quant\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/d4018570a16fb867f1c08412fc9c64bc\",\n\t            \"name\": \"Andrew Treadway\",\n\t            \"description\": \"Andrew Treadway currently works as a Senior Data Scientist, and has experience doing analytics, software automation, and ETL. He completed a master\u2019s degree in computer science \\\/ machine learning, and an undergraduate degree in pure mathematics. Connect with him on LinkedIn: https:\\\/\\\/www.linkedin.com\\\/in\\\/andrew-treadway-a3b19b103\\\/In addition to TheAutomatic.net blog, he also teaches in-person courses on Python and R through my NYC meetup: more details.\",\n\t            \"sameAs\": [\n\t                \"https:\\\/\\\/theautomatic.net\\\/about-me\\\/\"\n\t            ],\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/andrewtreadway\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Word Frequency Analysis | IBKR Quant","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/200750\/","og_locale":"en_US","og_type":"article","og_title":"Word Frequency Analysis","og_description":"Let's import the NLTK package, along with requests and BeautifulSoup, which we'll need to scrape the stock articles.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/","og_site_name":"IBKR Campus US","article_published_time":"2024-01-04T16:05:22+00:00","article_modified_time":"2024-01-04T16:05:34+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","type":"image\/jpeg"}],"author":"Andrew Treadway","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Andrew Treadway","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/"},"author":{"name":"Andrew Treadway","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/d4018570a16fb867f1c08412fc9c64bc"},"headline":"Word Frequency Analysis","datePublished":"2024-01-04T16:05:22+00:00","dateModified":"2024-01-04T16:05:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/"},"wordCount":871,"commentCount":0,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","keywords":["BeautifulSoup","Data Science","NLTK package","Python","Word Frequency Analysis"],"articleSection":["Programming Languages","Python Development","Quant","Quant Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/","name":"Word Frequency Analysis | IBKR Campus US","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","datePublished":"2024-01-04T16:05:22+00:00","dateModified":"2024-01-04T16:05:34+00:00","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/word-frequency-analysis\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","width":1000,"height":563,"caption":"Python Quant"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/d4018570a16fb867f1c08412fc9c64bc","name":"Andrew Treadway","description":"Andrew Treadway currently works as a Senior Data Scientist, and has experience doing analytics, software automation, and ETL. He completed a master\u2019s degree in computer science \/ machine learning, and an undergraduate degree in pure mathematics. Connect with him on LinkedIn: https:\/\/www.linkedin.com\/in\/andrew-treadway-a3b19b103\/In addition to TheAutomatic.net blog, he also teaches in-person courses on Python and R through my NYC meetup: more details.","sameAs":["https:\/\/theautomatic.net\/about-me\/"],"url":"https:\/\/www.interactivebrokers.com\/campus\/author\/andrewtreadway\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2023\/03\/python-quant-laptop-coding.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/200750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/388"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=200750"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/200750\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/186094"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=200750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=200750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=200750"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=200750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}