Skip to content

[Data/LLM] Error handling for invalid input rows #57140

@PawaritL

Description

@PawaritL

Description

When using build_llm_processor to perform some batch inference, there's currently no logic to filter out invalid rows, or rows that lead to errors. This leads to the whole pipeline failing (potentially due to just a few rows).

Proposal:

  • exception_handler_hook for the llm stage so that a user can customize error handling

Alternatively, support something similar to Gemini and OpenAI, which writes invalid/errored records to a separate file

Use case

When using build_llm_processor to perform some batch inference, there's currently no logic to filter out invalid rows, or rows that lead to errors. This leads to the whole pipeline failing (potentially due to just a few rows), see example error below:

2025-10-01 13:47:57,074 ERROR streaming_executor_state.py:553 -- An exception was raised from a task of operator "MapBatches(TokenizeUDF)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.

It’s hard to pre-check everything in the preprocess, so it would be nice to handle this and capture this information somewhere instead of failing the pipeline (especially on a very large dataset)

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataRay Data-related issuesenhancementRequest for new feature and/or capabilityllmstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions