Skip to content

Commit

Permalink
add support for multiple column documents & add read docs
Browse files Browse the repository at this point in the history
  • Loading branch information
meldonization committed Mar 19, 2020
1 parent dfd5b5b commit 22e666f
Show file tree
Hide file tree
Showing 26 changed files with 701 additions and 144 deletions.
128 changes: 116 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ from depdf import DePDF
from depdf import DePage

# general
with DePDF.load('test/test_general.pdf') as pdf
with DePDF.load('test/test.pdf') as pdf
pdf_html = pdf.to_html
print(pdf_html)

Expand All @@ -27,7 +27,7 @@ c = Config(
verbose_flag=True,
add_line_flag=True
)
pdf = DePDF.load('test/test_general.pdf', config=c)
pdf = DePDF.load('test/test.pdf', config=c)
page_index = 23 # start from zero
page = pdf_file.pages[page_index]
page_soup = page.soup
Expand Down Expand Up @@ -62,12 +62,12 @@ print(page_soup.text)
| `bbox` | bounding box region |
| `save_html` | write html tag to local file|

## DePDf HTML structure
## DePDF HTML structure
```html
<div class="{pdf_class}">
%for <!--page-{pid}-->
<div id="page-{}" class="{}">
%for {html_elements} endfor%
<div id="page-{pid}" class="{page_class}">
%for {in_page_elements} endfor%
</div>
endfor%
</div>
Expand All @@ -78,7 +78,7 @@ print(page_soup.text)
### Paragraph
```html
<p>
{paragraph-content}
{text-content}
<span> {span-content} </span>
...
</p>
Expand All @@ -93,7 +93,7 @@ print(page_soup.text)
...
</tr>
<tr colspan=2>
<td> {cell_1_0} </td>
<td> {merged_cell_1_0} </td>
...
</tr>
...
Expand All @@ -104,15 +104,119 @@ print(page_soup.text)
```
<img src="temp_depdf/$prefix.png"></img>
```

# Configuration encyclopedia

## PDF 解析

| **keyword** | detail | default |
|:---|---|---|
| logo_flag | 是否分析不同页面共有的水印信息 | `True` |
| header_footer_flag | 是否分析不同页面共有的页眉页脚信息 | `True` |
| temp_dir_prefix | 是否分析不同页面共有的页眉页脚信息 | temp_depdf |
| unique_prefix | 生成临时文件图片的文件名称(一般会自动生成) | |

## 页面解析

| **keyword** | detail | default |
|:---|---|---|
| table_flag | 是否解析表格 | `True` |
| paragraph_flag | 是否解析段落 | `True` |
| image_flag | 是否解析图片 | `True` |
| resolution | debug 模式下生成页面预览图的分辨率 | 300 |
| main_frame_tolerance | 识别页面内主要文字区域的阈值 | |
| x_tolerance | 识别页面内文本行的横向阈值 | |
| y_tolerance | 识别页面内文本行的纵向阈值 | |
| page_num_top_fraction | 识别页面内页码信息上边界距离和页面的高度比例 | |
| page_num_left_fraction | 识别页面内页码信息 | |
| page_num_right_fraction | 识别页面内页码信息 | |

## 页面分栏识别

| **keyword** | detail | default |
|:---|---|---|
| multiple_columns_flag | 是否识别多栏页面 | `True` |
| max_columns | 识别多栏页面栏数上限 | 3 |
| column_region_half_width | 识别多栏页面栏分界宽度 | |
| min_column_region_objects | 识别多栏页面栏分界内的对象数目上限 | |

## 字符提取

| **keyword** | detail | default |
|:---|---|---|
| char_overlap_size | 判断字符是否重叠的阈值 | |
| default_char_size | 默认的字符大小 | |
| char_size_upper | 探测到字符大小的上限 | |
| char_size_lower | 探测到字符大小的下限 | |

## 表格提取

| **keyword** | detail | default |
|:---|---|---|
| dotted_line_flag | 是否分析页面内的虚线 | |
| curved_line_flag | 是否分析页面内的曲线 | |
| snap_flag | 是否合并表格线段| |
| add_line_flag | 是否为表格增加横竖线 | |
| min_double_line_tolerance | 判断线段是否为临近双线的距离下限 | |
| max_double_line_tolerance | 判断线段是否为临近双线的距离上限 | |
| vertical_double_line_tolerance | 判断线段是否为垂直临近双线的距离上限 | |
| table_cell_merge_tolerance | 合并单元格的宽度差别容错值 | |
| skip_empty_table | 是否忽略空白表格 | |
| add_vertical_lines_flag | 是否增加竖线 | |
| add_horizontal_lines_flag | 是否增加横线 | |
| add_horizontal_line_tolerance | 增加横线的阈值 | |

## 图片提取

| **keyword** | detail | default |
|:---|---|---|
| min_image_size | 识别图片的边长最小像素值 | 80 |
| image_resolution | 提取图片的分辨率 | 300 |

## 页眉页脚识别

| **keyword** | detail | default |
|:---|---|---|
| default_head_tail_page_offset_percent | 页眉页脚的错位比例 | |

## 日志输出

| **keyword** | detail | default |
|:---|---|---|
| log_level | 日志的级别 | `WARNING` |
| verbose_flag | 是否输出运行中间过程信息 | `False` |
| debug_flag | 是否打开调试(生成解析对象的边界信息)| `False` |

## 生成的网页标签

| **keyword** | detail | default |
|:---|---|---|
| span_class | 生成 HTML 的 span 节点的 class | pdf-span |
| paragraph_class | 生成 HTML 的 p 节点的 class | pdf-paragraph |
| table_class | 生成 HTML 的 table 节点的 class | pdf-table |
| pdf_class | 生成 HTML 的最外层 pdf div 节点的 class | pdf-content |
| image_class | 生成 HTML 的 img 节点的 class | pdf-image |
| page_class | 生成 HTML 的 page div 的 class | pdf-page |
| mini_page_class | 生成 HTML 的 mini-page div 的 class | pdf-mini-page |


# Update log

* `2020-03-18` add support for multiple-column PDFs
* `2020-03-12` initial depdf realease


# Appendix

## todo

* [x] add support for multiple-column pdf page
* [x] better table structure recognition
* [x] recognize embedded objects inside page elements


## DePage element denotations
> Useful element properties within page
![page element](annotations.jpg)

## todo

* [ ] add support for multiple-column pdf page
* [ ] better table structure recognition
* [x] recognize embedded objects inside page elements
9 changes: 9 additions & 0 deletions depdf/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
"""
depdf
====================================
An ultimate pdf file disintegration tool.
DePDF is designed to extract tables and paragraphs
into structured markup language [eg. html] from embedding pdf pages.
You can also use it to convert page/pdf to html.
"""

from depdf.api import *
from depdf.config import Config
from depdf.pdf import DePDF
Expand Down
32 changes: 16 additions & 16 deletions depdf/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,46 +31,46 @@ def wrapper(pdf_file_path, *args, **kwargs):


@api_load_pdf
def convert_pdf_to_html(pdf_file, **kwargs):
def convert_pdf_to_html(pdf, **kwargs):
"""
:param pdf_file: pdf file absolute path
:param pdf: pdf file path
:param kwargs: config keyword arguments
:return:
:return: pdf html string
"""
return pdf_file.html
return pdf.html


@api_load_pdf
def convert_page_to_html(pdf_file, pid, **kwargs):
def convert_page_to_html(pdf, pid, **kwargs):
"""
:param pdf_file: pdf file absolute path
:param pdf: pdf file path
:param pid: page number start from 1
:param kwargs: config keyword arguments
:return:
:return: page html string
"""
page = pdf_file.pages[pid - 1]
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
return page.html


@api_load_pdf
def extract_page_tables(pdf_file, pid, **kwargs):
def extract_page_tables(pdf, pid, **kwargs):
"""
:param pdf_file: pdf file absolute path
:param pdf: pdf file path
:param pid: page number start from 1
:param kwargs: config keyword arguments
:return:
:return: page tables list
"""
page = pdf_file.pages[pid - 1]
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
return page.tables


@api_load_pdf
def extract_page_paragraphs(pdf_file, pid, **kwargs):
def extract_page_paragraphs(pdf, pid, **kwargs):
"""
:param pdf_file: pdf file absolute path
:param pdf: pdf file path
:param pid: page number start from 1
:param kwargs: config keyword arguments
:return:
:return: page paragraphs list
"""
page = pdf_file.pages[pid - 1]
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
return page.paragraphs
14 changes: 12 additions & 2 deletions depdf/base.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from decimal import Decimal

from depdf.utils import convert_html_to_soup
from depdf.error import BoxValueError
from depdf.utils import convert_html_to_soup, repr_str


class Box(object):
Expand Down Expand Up @@ -49,6 +49,9 @@ class Base(object):
_cached_properties = ['_html']
_html = ''

def __repr__(self):
return '<depdf.Base: {}>'.format(repr_str(self.soup.text))

@property
def html(self):
return self._html
Expand All @@ -61,6 +64,9 @@ def html(self, html_value):
def soup(self):
return convert_html_to_soup(self._html)

def to_soup(self, parser):
return convert_html_to_soup(self._html, parser=parser)

def write_to(self, file_name):
with open(file_name, "w") as file:
file.write(self.html)
Expand All @@ -69,7 +75,7 @@ def write_to(self, file_name):
def to_dict(self):
return {
i: getattr(self, i, None) for i in dir(self)
if not i.startswith('_') and i != 'to_dict'
if not i.startswith('_') and i not in ['to_dict', 'refresh', 'reset', 'write_to', 'to_soup']
}

def _get_cached_property(self, key, calculate_function, *args, **kwargs):
Expand Down Expand Up @@ -101,4 +107,8 @@ class InnerWrapper(Base):

@property
def inner_objects(self):
return self._inner_objects

@property
def to_dict(self):
return [obj.to_dict if hasattr(obj, 'to_dict') else obj for obj in self._inner_objects]
14 changes: 10 additions & 4 deletions depdf/components/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,20 @@ class Image(Base, Box):
object_type = 'image'

@check_config
def __init__(self, bbox=None, src='', pid=1, img_idx=1, scan=False, config=None):
def __init__(self, bbox=None, src='', percent=100, pid='1', img_idx=1, scan=False, config=None):
self.bbox = bbox
self.scan = scan
width = bbox[2] - bbox[0]
self.src = src
self.img_idx = img_idx
self.pid = pid
img_id = 'page-{pid}-image-{img_idx}'.format(pid=pid, img_idx=img_idx)
img_class = '{img_class} page-{pid}'.format(img_class=getattr(config, 'image_class'), pid=pid)
html = '<img id="{img_id}" class="{img_class}" src="{src}" width="{width}">'.format(
img_id=img_id, img_class=img_class, src=src, width=width
html = '<img id="{img_id}" class="{img_class}" src="{src}" width="{percent}%">'.format(
img_id=img_id, img_class=img_class, src=src, percent=min(round(percent), 100)
)
html += '</img>'
self.html = html

def __repr__(self):
scan_flag = '[scan]' if self.scan else ''
return '<depdf.Image{}: ({}, {}) -> {}>'.format(scan_flag, self.pid, self.img_idx, self.src)
8 changes: 5 additions & 3 deletions depdf/components/paragraph.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from depdf.base import Box, InnerWrapper
from depdf.config import check_config
from depdf.log import logger_init
from depdf.utils import calc_bbox, construct_style
from depdf.utils import calc_bbox, construct_style, repr_str

log = logger_init(__name__)

Expand All @@ -10,7 +10,7 @@ class Paragraph(InnerWrapper, Box):
object_type = 'paragraph'

@check_config
def __init__(self, bbox=None, text='', pid=1, para_idx=1, config=None, inner_objects=None, style=None, align=None):
def __init__(self, bbox=None, text='', pid='1', para_idx=1, config=None, inner_objects=None, style=None, align=None):
para_id = 'page-{pid}-paragraph-{para_id}'.format(pid=pid, para_id=para_idx)
para_class = '{para_class} page-{pid}'.format(para_class=getattr(config, 'paragraph_class'), pid=pid)
style_text = construct_style(style=style)
Expand All @@ -35,7 +35,9 @@ def __init__(self, bbox=None, text='', pid=1, para_idx=1, config=None, inner_obj
self.html = html

def __repr__(self):
return '<depdf.Paragraph: ({}, {})>'.format(self.pid, self.para_id)
if hasattr(self, 'text'):
return '<depdf.Paragraph: ({}, {}) {}>'.format(self.pid, self.para_id, repr_str(self.text))
return '<depdf.Paragraph[InnerObjects]: ({}, {})>'.format(self.pid, self.para_id)

def save_html(self):
paragraph_file_name = '{}_page_{}_paragraph_{}.html'.format(self.config.unique_prefix, self.pid, self.para_id)
Expand Down
5 changes: 4 additions & 1 deletion depdf/components/span.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from depdf.base import Base, Box
from depdf.config import check_config
from depdf.log import logger_init
from depdf.utils import construct_style
from depdf.utils import construct_style, repr_str

log = logger_init(__name__)

Expand All @@ -18,3 +18,6 @@ def __init__(self, bbox=None, span_text='', config=None, style=None):
self.html = '<span class="{span_class}"{style_text}>{span_text}</span>'.format(
span_class=span_class, span_text=span_text, style_text=style_text
)

def __repr__(self):
return '<depdf.Span: {}>'.format(repr_str(self.text))
Loading

0 comments on commit 22e666f

Please sign in to comment.