STRUCTURED REASONING FOR LLMS: A UNIFIED FRAMEWORK FOR EFFICIENCY AND EXPLAINABILITY


Abstract


Recent Large Language Models (LLMs) have made remarkable progress, but they still struggle with complex reasoning tasks such as logical deduction and planning. This is partly because they rely primarily on token-level probability relationships, which limits their ability to reason effectively. In this paper, inspired by cognitive science and neurosymbolic AI, we introduce Structured Reasoning, which aimes at enhancing the reasoning capabilities of LLMs from the step level. To this end, we first collect high‑frequency, domain‑agnostic reasoning step tags and construct a structured reasoning dataset with those tags. Then, we treat a reasoning process as a directed graph, where the vertices represent steps and the edges indicate the direction of reasoning. In this context, an efficient reasoning process corresponds to, or can be characterized by, a sparse reasoning graph. 1) To construct such a graph, we propose MAX-Flow, which maximizes step-to-step attention flow while minimizing the number of edges. The quality of a sparse reasoning graph can be reflected by the total flow from all steps to the final answer. 2) To improve the graph, we propse LCS (Longest Common Sequence), which selects reliable reasoning paths by identifying optimal common subsequences (consecutive steps) shared across multiple generated responses (sequences). Experiments with DeepSeek-R1-Distill-Qwen-1.5B and 7B models show that our method consistently outperforms GRPO and other carefully tuned baselines across various context lengths (0.5k–8k). Structured Reasoning shows particular strength in efficiency (better performance with fewer steps) and stability (consistently generating high-quality outputs across a temperature range of 0.1 to 1.0).

Method


Illustration of our three-stage pipeline for enhancing LLMs with Structured Reasoning. (1) Data Collection: Extract structured reasoning labels from unstructured LLM responses, producing outputs with explicit Question, Verify, and Answer components. (2) Step Dependency Computation: Compute step-to-step attention matrices to reveal step dependencies and construct directed graph. (3) Structure-Awared Opitmization: Apply Max-Flow algorithm for providing a significantly more accurate understanding of reasoning step dependencies compared to perplexity and LCS algorithm for improveing reasoning quality by identifying optimal common subsequences across multiple generated responses and leveraging these consistent steps as reliable reasoning paths.

Performance


We evaluate all models across six mathematics-focused benchmark datasets and three out-of-domain datasets (reading, legal, and massive multitask) to demonstrate the effectiveness of MaxFlow and LCS. We observe that our proposed structure-aware optimization methods consistently outperform other baselines for 1.5B models. Notably, MaxFlow with 4k training length achieves significant average improvement over GRPO and surpasses DeepScaleR-1.5B-Preview, which was trained with maximum 24k length and evaluated with 32k length. Similarly for 7B models, it shows that the LCS method performs excellently under 4k maximum length, while MaxFlow outperforms by a large margin across the entire length range.
We observe contrasting behaviors between baseline and structured reasoning models across temperature variations. Baseline DeepSeek-R1-Distill models exhibit significant temperature sensitivity, with performance improving substantially as temperature increases from 0.1 to 0.9. For example, the 1.5B baseline shows accuracy gains from 77.47 to 82.33 on MATH500 when temperature rises. This suggests that baseline models rely heavily on sampling diversity to achieve better performance. In contrast, our MaxFlow method maintains consistent performance across all temperature settings, achieving the lowest variance: ±0.53 on MATH500 and ±0.29 on OlympiadBench for the 1.5B model. This temperature robustness indicates that structured reasoning frameworks produce inherently stable outputs without requiring specific sampling parameters, making them more reliable.
Through IISR experiments, we found that as more reasoning steps were removed, our proposed methods based on step-matrix (top-k, top-p, and max-flow) significantly outperformed random removal. Additionally, in our comparison with perplexity-based algorithms, we found that removing steps with the lowest PPL (PPL Bottom) performed similarly (though slightly worse) to our methods when dealing with redundant but harmless information, as such information typically has low information content and low perplexity. Interestingly, for logically confused interference, removing steps with the highest PPL (PPL Top) performed slightly better, as steps appearing in inappropriate positions caused significantly increased perplexity. This shows that PPL primarily reflects information quantity and cannot distinguish valuable reasoning from disruptive content. Our step-matrix-based methods outperformed PPL-based approaches.
For the IISR (Interference Injection and Selective Removal) experiment, where we randomly inject N interference steps into an M-step reasoning process, the Error Filtering Efficiency is calculated as: EFE = 1 - (RetainedIrrelevantSteps / IrrelevantSteps), where IrrelevantSteps is the total number of interference steps injected (N), RetainedIrrelevantSteps is the number of interference steps that were incorrectly retained after filtering EFE measures the algorithm's ability to identify and remove irrelevant steps, with a value of 1.0 indicating perfect filtering (all interference steps removed) and 0.0 indicating no filtering capability.

Exploration


According to 70 samples from 1.5B and 7B models with our step attention matrix thresholded at 0.1, we found that layer 0 attends to an average of 6.82 reasoning steps, while layer 1 attends to only 1.41. This produces a repeating broad-versus-local alternation through approximately layers 0~13, suggesting an early division of labor between (i) layers that aggregate multi-step context and (ii) layers that perform local refinement anchored to the immediately preceding step. Beginning around layer 14, all subsequent layers attend to >8 steps (peaking at 12.06), marking a transition to a stable broad-span integration regime that more faithfully ranks step importance. The same qualitative pattern appears in both 1.5B and 7B models: early oscillatory specialization → mid/late sustained global integration. The 7B model shows a smoother (less jagged) broadening trajectory, whereas the 1.5B model preserves sharper alternating contrasts before converging. These consistent cross-scale dynamics imply (1) the broad-span mid–late blocks encode globally consolidating reasoning signals, and (2) pruning or distillation strategies could target redundant narrow-focus early layers or alternating pairs while preserving (or selectively enhancing) the globally integrative mid–late region.

Structured Reasoning Analyzer

Data Configuration

0/0

Structured Reasoning

Initializing...

Analysis Tools

We show how the MaxFlow method treats the reasoning process as a directed graph, then analyzes and visualizes it. You can change how many reasoning steps are kept by tuning three methods that use the step attention matrix (from structured reasoning models with 14, 16, 19, and 23 layers): MaxFlow, TopP, and TopK. This lets you see, from an LLM’s point of view, which steps matter most for the final answer.

MaxFlow: Based on the “maximum flow – minimum cut” idea. It checks how much the attention flow to the answer drops when we remove each reasoning step. A bigger drop means the step is more important.

TopK: Start from the answer node and move backward. At each step, keep the K previous steps with the highest attention (greedy). Repeat until you reach the question.

TopP: Start from the answer and move backward. Sort previous steps by attention, then keep steps until their total probability reaches P. Repeat until you reach the question. This is similar to top‑p sampling in LLMs.

Initializing...

Code (Coming soon)


Repository

All code for reproducing our Structured Reasoning pipeline (data extraction, step dependency computation, and structure-aware optimization) will be released here:

GitHub (coming soon)


Checkpoints

Pretrained and finetuned checkpoints will be hosted here for direct download:


Datasets

We will release both raw prompts and structured annotations (Question, Verify, Answer), along with step dependency graphs.

Hugging Face datasets page: Hugging Face (coming soon)


Note: Links and artifacts are placeholders and will be updated once uploads are complete.