Getting Data from PDFs the Easy Way with R

Originally posted on TheAutomatic.net.

Excerpt

If you don’t have tabulizer installed, just run install.packages(“tabulizer”) to get started.

Initial Setup

After you have tabulizer installed, we’ll load it, and define a variable referencing an example PDF.

library(tabulizer)
 
site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"

The PDFs you manipulate with this package don’t have to be located on your machine — you can use tabulizer to reference a PDF by a URL. For our first example, we’re going to use a sample PDF file found here: http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

How to extract all the tables from a PDF

You can extract tables from this PDF using the aptly-named extract_tables function, like this:

# default call with no parameters changed
matrix_results <- extract_tables(site)
 
# get back the tables as data frames, keeping their headers
df_results <- extract_tables(site, output = "data.frame", header = TRUE)

By default, this function will return a matrix for each table, as in the first line of code above. However, as in the second line, we can add parameters to the function to specify the output flag to be data.frame, and set header = TRUE, to get back a list of data frames corresponding to the tables in the PDF.

Once we have the results back, we can refer to any individual PDF table like any data frame we normally would in R.

first_df <- df_results[[1]]
 
first_df$Number.of.Coils

How to scrape text from a PDF

Scraping text from our sample PDF can be done using extract_text:

text <- extract_text(site)
 
# print text
cat(text)

How to split up a PDF by its pages

tabulizer can also create separate files for the pages in a PDF. This can be done using the split_pdf function:

# split PDF referenced above
# output separate page files to current directory
split_pdf(site, getwd())
 
# or output to different directory
split_pdf(site, "C:/path/to/other/folder")

The first argument of split_pdf is the filename or URL of your PDF; the second argument is the directory where you want the individual pages to be output.

How to merge a collection of PDFs

What if we want to reverse what we just did? We can use the merge_pdfs function, which takes as input a vector of file names and and the name of the output file which will be the result of merging the files together.

merge_pdfs("C:/path/to/pdf/files", "C:/path/to/merged_result.pdf")

How to get the number of pages in a PDF

Getting the number of pages in a PDF is made easy with the get_n_pages function, which you can call like this:

get_n_pages(site)

How to get metadata associated with a PDF

You can get metadata associated with our PDF using extract_metadata:

extract_metadata(site)

This function returns a list containing information showing the number of pages, title, created / modified dates, and more.

Join The Conversation

For specific platform feedback and suggestions, please submit it directly to our team using these instructions.

If you have an account-specific question or concern, please reach out to Client Services.

We encourage you to look through our FAQs before posting. Your question may already be covered!

Visit IBKR.com Open an IBKR Account

Leave a Reply Cancel reply

Disclosure: Interactive Brokers Third Party

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

How much could you save on your margin loan by switching to Interactive Brokers?

Fill out the information below to see your estimated savings.

Current Interest Rate

Balance

USD

Margin Amount Borrowed

USD

Time Margin is Borrowed

IBKR will assess a surcharge of 1% on large loan balances unless otherwise prearranged with IBKR. The 1% surcharge would apply to all balances in the highest tier.

The interest calculator is based on information that we believe to be accurate and correct, but neither Interactive Brokers LLC nor its affiliates warrant its accuracy or adequacy and it should not be relied upon as such. Neither IBKR nor its affiliates are responsible for any errors or omissions or for results obtained from the use of this calculator.

Restrictions apply. Annual Percentage Rate (APR) on USD margin loan balances for IBKR Pro as of October 3, 2024. Interactive Brokers calculates the interest charged on margin loans using the applicable rates for each interest rate tier listed on its website. Learn more about margin loan rates.

The projections or other information generated by the Interest Calculator tool are hypothetical in nature, do not reflect actual results and are not guarantees of future results. Please note that results may vary with use of the tool over time.

Trading on margin is only for experienced investors with high risk tolerance. You may lose more than your initial investment. For additional information about rates on margin loans, please see Margin Loan Rates.

Getting Data from PDFs the Easy Way with R

Initial Setup

How to extract all the tables from a PDF

How to scrape text from a PDF

How to split up a PDF by its pages

How to merge a collection of PDFs

How to get the number of pages in a PDF

How to get metadata associated with a PDF

Join The Conversation

Leave a Reply Cancel reply

Disclosure: Interactive Brokers Third Party

Information on Other Interactive Brokers Affiliates

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

Initial Setup

How to extract all the tables from a PDF

How to scrape text from a PDF

How to split up a PDF by its pages

How to merge a collection of PDFs

How to get the number of pages in a PDF

How to get metadata associated with a PDF

Join The Conversation

Leave a Reply Cancel reply

Disclosure: Interactive Brokers Third Party

Bi-Weekly Newsletter

Daily Newsletter

Weekly Newsletter

Weekly Newsletter

Monthly Newsletter