Evaluate high school math reasoning in LLMs with baseline and Chain-of-Thought (CoT) prompts. Includes confidence calibration metrics, JSON output parsing, and reliability analysis.
openai gpt json-parsing model-evaluation interpretability reliability-analysis confidence-calibration llm prompt-engineering chain-of-thought-reasoning safe-ai
-
Updated
May 29, 2025 - Python