-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
We are extract text normally with docling with this pdf. We are getting duplicate characters.
Text in PDF

Text Extracted
- (4) Given the sensitivity of personal electronic health data, this Regulation seeks to provide sufficient safeguards at both Union and national level to ensure a high degree of data protection, security, y, confidentiality and ethical use. Such safeguards are necessary to promote trust in safe handling of electronic health data of natural persons for primary use and secondary use as defined in this Regulation.
here we have security, y,
instead of security,
Steps to reproduce
ocr_options = TesseractOcrOptions()
device = AcceleratorDevice.AUTO
accelerator_options = AcceleratorOptions(
num_threads=2,
device=device,
)
pipeline_options = PdfPipelineOptions(
do_ocr=False,
ocr_options=ocr_options,
table_structure_options=TableStructureOptions(
mode=TableFormerMode.FAST, do_cell_matching=True
),
do_table_structure=True,
accelerator_options=accelerator_options,
artifacts_path=self.docling_artifacts_path or None,
)
converter = CustomDocumentConverter(
allowed_formats=DOCLING_ALLOWED_FORMATS,
format_options={
InputFormat.PDF: PdfFormatOption(
backend=PyPdfiumDocumentBackend,
pipeline_options=pipeline_options,
),
},
)
# Then invoked like this:
md_text = converter.convert()
Docling version
docling 2.54.0
docling-core 2.48.4
docling-ibm-models 3.9.1
docling-parse 4.5.0
Python version
Python 3.12.7
PDF File needed to reproduce this bug
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working