Sunday, March 17, 2024

Extracting Textual content from PDF Information with Python: A Complete Information | by George Stavrakis | Sep, 2023

Must read

Now that we now have all of the parts of the code prepared let’s add all of them as much as a totally useful code. You possibly can copy the code from right here or you could find it together with the instance PDF in my Github repo right here.

# Discover the PDF path
pdf_path = 'OFFER 3.pdf'

# create a PDF file object
pdfFileObj = open(pdf_path, 'rb')
# create a PDF reader object
pdfReaded = PyPDF2.PdfReader(pdfFileObj)

# Create the dictionary to extract textual content from every picture
text_per_page = {}
# We extract the pages from the PDF
for pagenum, web page in enumerate(extract_pages(pdf_path)):

# Initialize the variables wanted for the textual content extraction from the web page
pageObj = pdfReaded.pages[pagenum]
page_text = []
line_format = []
text_from_images = []
text_from_tables = []
page_content = []
# Initialize the variety of the examined tables
table_num = 0
first_element= True
table_extraction_flag= False
# Open the pdf file
pdf =
# Discover the examined web page
page_tables = pdf.pages[pagenum]
# Discover the variety of tables on the web page
tables = page_tables.find_tables()

# Discover all the weather
page_elements = [(element.y1, element) for element in page._objs]
# Kind all the weather as they seem within the web page
page_elements.type(key=lambda a: a[0], reverse=True)

# Discover the weather that composed a web page
for i,part in enumerate(page_elements):
# Extract the place of the highest facet of the aspect within the PDF
pos= part[0]
# Extract the aspect of the web page format
aspect = part[1]

# Test if the aspect is a textual content aspect
if isinstance(aspect, LTTextContainer):
# Test if the textual content appeared in a desk
if table_extraction_flag == False:
# Use the perform to extract the textual content and format for every textual content aspect
(line_text, format_per_line) = text_extraction(aspect)
# Append the textual content of every line to the web page textual content
# Append the format for every line containing textual content
# Omit the textual content that appeared in a desk

# Test the weather for photos
if isinstance(aspect, LTFigure):
# Crop the picture from the PDF
crop_image(aspect, pageObj)
# Convert the cropped pdf to a picture
# Extract the textual content from the picture
image_text = image_to_text('PDF_image.png')
# Add a placeholder within the textual content and format lists

# Test the weather for tables
if isinstance(aspect, LTRect):
# If the primary rectangular aspect
if first_element == True and (table_num+1) <= len(tables):
# Discover the bounding field of the desk
lower_side = web page.bbox[3] - tables[table_num].bbox[3]
upper_side = aspect.y1
# Extract the data from the desk
desk = extract_table(pdf_path, pagenum, table_num)
# Convert the desk info in structured string format
table_string = table_converter(desk)
# Append the desk string into a listing
# Set the flag as True to keep away from the content material once more
table_extraction_flag = True
# Make it one other aspect
first_element = False
# Add a placeholder within the textual content and format lists

# Test if we already extracted the tables from the web page
if aspect.y0 >= lower_side and aspect.y1 <= upper_side:
elif not isinstance(page_elements[i+1][1], LTRect):
table_extraction_flag = False
first_element = True

# Create the important thing of the dictionary
dctkey = 'Page_'+str(pagenum)
# Add the record of record as the worth of the web page key
text_per_page[dctkey]= [page_text, line_format, text_from_images,text_from_tables, page_content]

# Closing the pdf file object

# Deleting the extra information created
os.take away('cropped_image.pdf')
os.take away('PDF_image.png')

# Show the content material of the web page
outcome = ''.be part of(text_per_page['Page_0'][4])

The script above will:

Import the mandatory libraries.

Open the PDF file utilizing the pyPDF2 library.

Extract every web page of the PDF and iterate the next steps.

Study if there are any tables on the web page and create a listing of them utilizing pdfplumner.

Discover all the weather nested within the web page and type them as they appeared in its format.

Then for every aspect:

Study if it’s a textual content container, and doesn’t seem in a desk aspect. Then use the text_extraction() perform to extract the textual content together with its format, else move this textual content.

Study whether it is a picture, and use the crop_image() perform to crop the picture part from the PDF, convert it into a picture file utilizing the convert_to_images(), and extract textual content from it utilizing OCR with the image_to_text() perform.

Study if it’s a rectangular aspect. On this case, we look at if the primary rect is a part of a web page’s desk and if sure, we transfer to the next steps:

  1. Discover the bounding field of the desk so as to not extract its textual content once more with the text_extraction() perform.
  2. Extract the content material of the desk and convert it right into a string.
  3. Then add a boolean parameter to make clear that we extract textual content from Desk.
  4. This course of will end after the final LTRect that falls into the bounding field of the desk and the subsequent aspect within the format shouldn’t be an oblong object. (All the opposite objects that compose the desk will probably be handed)

The outputs of the method will probably be saved in 5 lists per iteration, named:

  1. page_text: incorporates the textual content coming from textual content containers within the PDF (placeholder will probably be positioned when the textual content was extracted from one other aspect)
  2. line_format: incorporates the codecs of the texts extracted above (placeholder will probably be positioned when the textual content was extracted from one other aspect)
  3. text_from_images: incorporates the texts extracted from photos on the web page
  4. text_from_tables: incorporates the table-like string with the contents of tables
  5. page_content: incorporates all of the textual content rendered on the web page in a listing of components

All of the lists will probably be saved underneath the important thing in a dictionary that can signify the variety of the web page examined every time.

Afterwards, we’ll shut the PDF file.

Then we’ll delete all the extra information created in the course of the course of.

Lastly, we are able to show the content material of the web page by becoming a member of the weather of the page_content record.

Supply hyperlink

More articles


Please enter your comment!
Please enter your name here

Latest article