I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.
I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on.
This approach does not use PDFMiner but does the trick.
First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text
If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.
Have a look at PDFlib, it can extract font info as you require and has a Python library you can import in your scripts and work with it.
If you want to get the font size or font name from a PDF file using PDF miner library you have to interpret the whole pdf page. You should decide for which word, phrase do you want to get font size and font name(as on a page you can have multiple words with different font sizes). The structure using PDF miner for a page: PDFPageInterpreter -> LTTextBox -> LTChar Once you found out for which word you want to get font size you call: size method for font size(which actually is height), and fontname for font. Code should look like this, you pass the pdf file path, word for which you want to get font size and the page number(on which page is the searched word):
You could check what other properties LTChar class supports
Some informations are in lower level, in the LTChar class. It seems logic because font size, italic, bold, etc, can be applied to a single character.
More infos here : https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L222
But I'm still confuse about font color not in this class
I hope this could help you :)
Get the font-family:
Get the font-size:
Get the font-positon:
Get the info of image: