{"id":115650,"date":"2021-12-27T09:00:00","date_gmt":"2021-12-27T14:00:00","guid":{"rendered":"https:\/\/ibkrcampus.com\/?p=115650"},"modified":"2022-11-21T09:50:22","modified_gmt":"2022-11-21T14:50:22","slug":"how-to-read-pdf-files-with-python","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/","title":{"rendered":"How to Read PDF Files with Python"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><strong>Background<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In a previous article, we talked about how to&nbsp;<a href=\"https:\/\/theautomatic.net\/2019\/05\/24\/3-ways-to-scrape-tables-from-pdfs-with-python\/\">scrape tables from PDF files with Python<\/a>. In this post, we\u2019ll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages \u2013&nbsp;<strong>pdfminer<\/strong>&nbsp;and&nbsp;<strong>pytesseract<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>pdfminer<\/strong>&nbsp;(specifically&nbsp;<strong>pdfminer.six<\/strong>, which is a more up-to-date fork of&nbsp;<strong>pdfminer<\/strong>) is an effective package to use if you\u2019re handling PDFs that are typed and you\u2019re able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the&nbsp;<strong>pytesseract<\/strong>&nbsp;package comes in handy, which we\u2019ll see later in the post.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-scraping-hightlightable-text\"><strong>Scraping hightlightable text<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For the first example, let\u2019s scrape a 10-k form from Apple (<a href=\"https:\/\/s2.q4cdn.com\/470004039\/files\/doc_financials\/2019\/ar\/_10-K-2019-(As-Filed).pdf\">see here<\/a>). First, we\u2019ll just download this file to a local directory and save it as \u201capple_10k.pdf\u201d. The first package we\u2019ll be using to extract text is&nbsp;<strong>pdfminer<\/strong>. To download the version of the package we need, you can use pip (note we\u2019re downloading&nbsp;<strong>pdfminer.six<\/strong>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install pdfminer.six\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, let\u2019s import the&nbsp;<em>extract_text<\/em>&nbsp;method from&nbsp;<strong>pdfminer.high_level<\/strong>. This module within&nbsp;<strong>pdfminer<\/strong>&nbsp;provides higher-level functions for scraping text from PDF files. The&nbsp;<em>extract_text<\/em>&nbsp;function, as can be seen below, shows that we can extract text from a PDF with one line code (minus the package import)! This is an advantage of&nbsp;<strong>pdfminer<\/strong>&nbsp;versus some other packages like&nbsp;<strong>PyPDF2<\/strong>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pdfminer.high_level import extract_text\n \ntext = extract_text(\"apple_10k.pdf\")\n \nprint(text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The code above will extract the text from each page in the PDF. If we want to limit our extraction to specific pages, we just need to pass that specification to&nbsp;<em>extract_text<\/em>&nbsp;using the&nbsp;<em>page_numbers<\/em>&nbsp;parameter.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># extract text from the first 10 pages\ntext10 = extract_text(\"apple_10k.pdf\", page_numbers = range(10))\n \n# get text from pages 0, 2, and 4\ntext_pages = extract_text(\"apple_10k.pdf\", page_numbers = &#91;0, 2, 4])<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Scraping a password-protected PDF<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If the PDF we want to scrape is password-protected, we just need to pass the password as a parameter to the same method as above.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>text = extract_text(\"apple_10k.pdf\", password = \"top secret password\")\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Scraping text from scanned-in images<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If a PDF contains scanned-in images of text, then it\u2019s still possible to be scraped, but requires a few additional steps. In this case, we\u2019re going to be using two other Python packages \u2013&nbsp;<strong>pytesseract<\/strong>&nbsp;and&nbsp;<strong>Wand<\/strong>. The second of these is used to convert PDFs into image files, while&nbsp;<strong>pytesseract<\/strong>&nbsp;is used to extract text from images. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since&nbsp;<strong>pytesseract<\/strong>&nbsp;doesn\u2019t work directly on PDFs, we have to first convert our sample PDF into an image (or collection of image files).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Initial setup<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s get started by setting up the&nbsp;<strong>Wand<\/strong>&nbsp;package.&nbsp;<strong>Wand<\/strong>&nbsp;can be installed using pip:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install Wand\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This package also requires a tool called&nbsp;<em>ImageMagick<\/em>&nbsp;to be installed (<a href=\"https:\/\/docs.wand-py.org\/en\/latest\/guide\/install.html\">see here for more details<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are other options for packages that convert PDFs into images files. For example,&nbsp;<a href=\"https:\/\/github.com\/Belval\/pdf2image\">pdf2image<\/a>&nbsp;is another choice, but we\u2019ll use&nbsp;<strong>Wand<\/strong>&nbsp;in this tutorial.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally, let\u2019s go ahead and install&nbsp;<strong>pytesseract<\/strong>. This package can also be installed using pip:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install pytesseract\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>pytesseract<\/strong>\u00a0depends upon\u00a0<em>tesseract<\/em>\u00a0being installed (<a href=\"https:\/\/github.com\/tesseract-ocr\/tessdoc\">see here for instructions<\/a>).\u00a0<em>tesseract<\/em>\u00a0is an underlying utility that performs OCR (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\">Optical Character Recognition<\/a>) on images to extract text.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Converting PDFs into image files<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Now, once our setup is complete, we can convert a PDF into a collection of image files. The way we do this is by converting each individual page into an image file. In addition to using&nbsp;<strong>Wand<\/strong>, we\u2019re also going to import the&nbsp;<a href=\"https:\/\/theautomatic.net\/2017\/08\/31\/file-manipulation-with-python\/\">os<\/a>&nbsp;package to help create the name of each image output file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this example, we\u2019re going to take a scanned-in version of the first three pages of the 10k form from earlier in this post.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from wand.image import Image\nimport os\n \npdf_file = \"scanned_apple_10k_snippet.pdf\"\n \nfiles = &#91;]\nwith(Image(filename=pdf_file, resolution = 500)) as conn: \n    for index, image in enumerate(conn.sequence):\n        image_name = os.path.splitext(pdf_file)&#91;0] + str(index + 1) + '.png'\n        Image(image).save(filename = image_name)\n        files.append(image_name)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In the&nbsp;<em>with<\/em>&nbsp;statement above, we open a connection to the PDF file. The resolution parameter specifies the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Dots_per_inch\">DPI<\/a>&nbsp;we want for the image outputs \u2013 in this case 500. Within the for loop, we specify the output filename, save the image using&nbsp;<em>Image.save<\/em>, and lastly append the filename to the list of image files. This way, we can loop over the list of image files, and scrape the text from each.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This should create three separate image files:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;\"scanned_apple_10k_snippet1.png\", \n \"scanned_apple_10k_snippet2.png\", \n \"scanned_apple_10k_snippet3.png\"]<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using pytesseract on each image file<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we can use&nbsp;<strong>pytesseract<\/strong>&nbsp;to extract the text from each image file. In the code below, we store the extracted text from each page as a separate element in a list.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>all_text = &#91;]\nfor file in files:\n    text = pytesseract.image_to_string(Image.open(file))\n    all_text.append(text)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Alternatively, we can use a&nbsp;<a href=\"https:\/\/theautomatic.net\/tutorial-on-python-list-comprehensions\/\">list comprehension<\/a>&nbsp;like below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>all_text = &#91;pytesseract.image_to_string(Image.open(file)) for file in files]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Visit <a href=\"https:\/\/theautomatic.net\/2020\/01\/21\/how-to-read-pdf-files-with-python\/\">TheAutomatic.net<\/a> learn more about this topic.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, we\u2019ll cover how to extract text from several types of PDFs. <\/p>\n","protected":false},"author":388,"featured_media":28581,"comment_status":"closed","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[339,343,349,338,341,352,344],"tags":[806,10808,10807,595],"contributors-categories":[13695],"class_list":["post-115650","post","type-post","status-publish","format-standard","has-post-thumbnail","category-data-science","category-programing-languages","category-python-development","category-ibkr-quant-news","category-quant-development","category-quant-north-america","category-quant-regions","tag-data-science","tag-pdfminer","tag-pytesseract","tag-python","contributors-categories-theautomatic-net"],"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.8) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>How to Read PDF Files with Python | IBKR Quant<\/title>\n<meta name=\"description\" content=\"In this post, we\u2019ll cover how to extract text from several types of PDFs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/115650\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Read PDF Files with Python | IBKR Quant Blog\" \/>\n<meta property=\"og:description\" content=\"In this post, we\u2019ll cover how to extract text from several types of PDFs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2021-12-27T14:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-11-21T14:50:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"540\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Andrew Treadway\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Andrew Treadway\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Andrew Treadway\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/d4018570a16fb867f1c08412fc9c64bc\"\n\t            },\n\t            \"headline\": \"How to Read PDF Files with Python\",\n\t            \"datePublished\": \"2021-12-27T14:00:00+00:00\",\n\t            \"dateModified\": \"2022-11-21T14:50:22+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/\"\n\t            },\n\t            \"wordCount\": 778,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2019\\\/12\\\/python-gears.jpg\",\n\t            \"keywords\": [\n\t                \"Data Science\",\n\t                \"pdfminer\",\n\t                \"pytesseract\",\n\t                \"Python\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Data Science\",\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Development\",\n\t                \"Quant North America\",\n\t                \"Quant Regions\"\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/\",\n\t            \"name\": \"How to Read PDF Files with Python | IBKR Quant Blog\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2019\\\/12\\\/python-gears.jpg\",\n\t            \"datePublished\": \"2021-12-27T14:00:00+00:00\",\n\t            \"dateModified\": \"2022-11-21T14:50:22+00:00\",\n\t            \"description\": \"In this post, we\u2019ll cover how to extract text from several types of PDFs.\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/how-to-read-pdf-files-with-python\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2019\\\/12\\\/python-gears.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2019\\\/12\\\/python-gears.jpg\",\n\t            \"width\": 900,\n\t            \"height\": 540,\n\t            \"caption\": \"Python\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/d4018570a16fb867f1c08412fc9c64bc\",\n\t            \"name\": \"Andrew Treadway\",\n\t            \"description\": \"Andrew Treadway currently works as a Senior Data Scientist, and has experience doing analytics, software automation, and ETL. He completed a master\u2019s degree in computer science \\\/ machine learning, and an undergraduate degree in pure mathematics. Connect with him on LinkedIn: https:\\\/\\\/www.linkedin.com\\\/in\\\/andrew-treadway-a3b19b103\\\/In addition to TheAutomatic.net blog, he also teaches in-person courses on Python and R through my NYC meetup: more details.\",\n\t            \"sameAs\": [\n\t                \"https:\\\/\\\/theautomatic.net\\\/about-me\\\/\"\n\t            ],\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/andrewtreadway\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How to Read PDF Files with Python | IBKR Quant","description":"In this post, we\u2019ll cover how to extract text from several types of PDFs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/115650\/","og_locale":"en_US","og_type":"article","og_title":"How to Read PDF Files with Python | IBKR Quant Blog","og_description":"In this post, we\u2019ll cover how to extract text from several types of PDFs.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/","og_site_name":"IBKR Campus US","article_published_time":"2021-12-27T14:00:00+00:00","article_modified_time":"2022-11-21T14:50:22+00:00","og_image":[{"width":900,"height":540,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","type":"image\/jpeg"}],"author":"Andrew Treadway","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Andrew Treadway","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/"},"author":{"name":"Andrew Treadway","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/d4018570a16fb867f1c08412fc9c64bc"},"headline":"How to Read PDF Files with Python","datePublished":"2021-12-27T14:00:00+00:00","dateModified":"2022-11-21T14:50:22+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/"},"wordCount":778,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","keywords":["Data Science","pdfminer","pytesseract","Python"],"articleSection":["Data Science","Programming Languages","Python Development","Quant","Quant Development","Quant North America","Quant Regions"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/","name":"How to Read PDF Files with Python | IBKR Quant Blog","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","datePublished":"2021-12-27T14:00:00+00:00","dateModified":"2022-11-21T14:50:22+00:00","description":"In this post, we\u2019ll cover how to extract text from several types of PDFs.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/how-to-read-pdf-files-with-python\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","width":900,"height":540,"caption":"Python"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/d4018570a16fb867f1c08412fc9c64bc","name":"Andrew Treadway","description":"Andrew Treadway currently works as a Senior Data Scientist, and has experience doing analytics, software automation, and ETL. He completed a master\u2019s degree in computer science \/ machine learning, and an undergraduate degree in pure mathematics. Connect with him on LinkedIn: https:\/\/www.linkedin.com\/in\/andrew-treadway-a3b19b103\/In addition to TheAutomatic.net blog, he also teaches in-person courses on Python and R through my NYC meetup: more details.","sameAs":["https:\/\/theautomatic.net\/about-me\/"],"url":"https:\/\/www.interactivebrokers.com\/campus\/author\/andrewtreadway\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2019\/12\/python-gears.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/115650","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/388"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=115650"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/115650\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/28581"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=115650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=115650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=115650"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=115650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}