-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
Description
When using build_llm_processor
to perform some batch inference, there's currently no logic to filter out invalid rows, or rows that lead to errors. This leads to the whole pipeline failing (potentially due to just a few rows).
Proposal:
exception_handler_hook
for the llm stage so that a user can customize error handling
Alternatively, support something similar to Gemini and OpenAI, which writes invalid/errored records to a separate file
Use case
When using build_llm_processor to perform some batch inference, there's currently no logic to filter out invalid rows, or rows that lead to errors. This leads to the whole pipeline failing (potentially due to just a few rows), see example error below:
2025-10-01 13:47:57,074 ERROR streaming_executor_state.py:553 -- An exception was raised from a task of operator "MapBatches(TokenizeUDF)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
It’s hard to pre-check everything in the preprocess, so it would be nice to handle this and capture this information somewhere instead of failing the pipeline (especially on a very large dataset)