Skip to content

Duplicated characters near comma #2377

@gabinguo

Description

@gabinguo

Bug

We are extract text normally with docling with this pdf. We are getting duplicate characters.

Text in PDF

Image

Text Extracted

  • (4) Given the sensitivity of personal electronic health data, this Regulation seeks to provide sufficient safeguards at both Union and national level to ensure a high degree of data protection, security, y, confidentiality and ethical use. Such safeguards are necessary to promote trust in safe handling of electronic health data of natural persons for primary use and secondary use as defined in this Regulation.

here we have security, y, instead of security,

Steps to reproduce

ocr_options = TesseractOcrOptions()
device = AcceleratorDevice.AUTO

accelerator_options = AcceleratorOptions(
    num_threads=2,
    device=device,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    ocr_options=ocr_options,
    table_structure_options=TableStructureOptions(
        mode=TableFormerMode.FAST, do_cell_matching=True
    ),
    do_table_structure=True,
    accelerator_options=accelerator_options,
    artifacts_path=self.docling_artifacts_path or None,
)

converter = CustomDocumentConverter(
    allowed_formats=DOCLING_ALLOWED_FORMATS,
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=PyPdfiumDocumentBackend,
            pipeline_options=pipeline_options,
        ),
    },
)

# Then invoked like this:
md_text = converter.convert()

Docling version

docling                                  2.54.0
docling-core                             2.48.4
docling-ibm-models                       3.9.1
docling-parse                            4.5.0

Python version

Python 3.12.7


PDF File needed to reproduce this bug

PDF.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions