APTO Unveils Training Dataset to Improve Mathematical Reasoning Skills of Large Language Models (LLMs)

TOKYO, Oct. 1, 2025 /PRNewswire/ — As generative AI use continues to increase, accuracy has become the most important metric and a key factor in decisions around adoption and utilization. APTO is committed to supporting companies and organizations through high-quality AI data.

In recent years, the performance of large language models (LLMs) has improved dramatically. However, challenges persist, particularly in mathematical tasks that require multi-step calculations or strict formatting, leading to errors and formatting issues. In response, APTO has stepped forward with the release of a training dataset specifically designed to enhance reasoning and answer accuracy in these mathematical contexts.

Understanding the LLM Dataset for Mathematical Reasoning

Most LLM developers and users have encountered multiple challenges while engaging with mathematical queries:

The model often skips step-by-step calculations.
Accuracy is frequently compromised; wrong answers originate from flawed processes.
The answers presented may not conform to the required formatting, leading to further inaccuracies.
Critical problem-solving steps might be omitted, resulting in a mere final answer without the process shown.

These issues indicate that even advanced models can stumble in navigating complex mathematical landscapes. This motivated APTO to draw upon their rich experience in enhancing reasoning abilities to formulate a robust dataset focusing on intricate mathematical problems.

An Overview of the Dataset

The newly released dataset consists of mathematical reasoning data in JSON Lines format. Created through a mix of machine generation and human verification, it caters to training Process/Preference Reward Models (PRMs). This dataset not only includes problem statements and answers but also details the reasoning process (known as Chain-of-Thought reasoning) along with evaluation information for each step. This enables a qualitative assessment rather than a binary right-or-wrong determination.

Contents of the Dataset

Problem: The mathematical problems to be solved.
Expected_answer: Helpful for automated grading and format verification.
Generated_answer: Critical for analyzing error patterns.
Answer_match: Assists in difficulty adjustment and sampling control.
Step evaluations: Each includes {step_index, step_text, verdict} to facilitate detailed supervision.
Metadata for step evaluations: Classifies each attempt as all_correct, partial_correct, or other relevant categories.

Illustrative Example of Reasoning Breakdown

An example of the reasoning process breaking down midway.

In one example, a math problem’s geometric constraints might be misunderstood, leading to flawed calculations or conclusions. In such cases, the initial logic could be sound, but errors manifest later in the reasoning, exemplified by a calculated difference of 300 that should not exist—categorized as a partial_correct output.

Types of Issues Addressed

To ensure a comprehensive representation of mathematical reasoning, questions in the dataset are categorized into:

Calculus
Algebra
Geometry
Probability, Statistics, and Discrete Mathematics

Enhancing the Reasoning Process

The ‘Reasoning Process (Chain-of-Thought)’ structure promotes organizing the step-by-step approach to solving mathematical problems effectively. The dataset allows models to navigate through the logic flow: reading the problem, performing calculations incrementally, and arriving at the final answer. Each problem requires at least two reasoning steps, with some extending to eight.

Evaluating Performance Enhancements

In evaluating the efficacy of this dataset, models trained using it were compared against external benchmarks like the AIME problem sets. This process involves a multitasking approach that engages both PRM and causal language modeling:

Fine-Tuning: The models were fine-tuned using appropriate reasoning data as the training foundation. This method incorporates various evaluation elements to classify correctness based on derived outputs.
Performance Comparison: The answer accuracy was analyzed pre- and post-training using benchmarks, revealing significant performance improvements—showcasing an average enhancement of 10.0 points post-deployment.

Results from the 2024 and 2025 AIME compares perfectly showcased the dataset’s effectiveness:

Exam Year	No. of Questions	Pre-Training Accuracy	Post-Training Accuracy	Improvement Margin
2024	30	26.7%	36.7%	+10.0pt
2025	30	33.3%	43.3%	+10.0pt

The impressive improvements highlight the capacity to tackle complex calculations more accurately, preventing breakdowns during intermediary steps.

Availability and Future Developments

The newly established dataset is accessible on Hugging Face:

https://huggingface.co/datasets/APTOinc/llm-math-reasoning-dataset

Existing clients will soon receive updates through newsletters, ensuring they remain informed.

With a vision towards the future, APTO recognizes the significance of clearly delineating logical steps in addition to arriving at correct answers, thereby enhancing datasets in logical reasoning spaces. Development is ongoing to meet the growing needs of AI technologies and user demands.

About APTO

APTO specializes in supporting AI development with an emphasis on data—the primary driver for achieving accuracy. Our services include:

harBest: A platform for data collection and annotation driven by crowd-sourced workers.
harBest Dataset: Enhancing the data preparation process typically seen as a bottleneck in early development stages.
harBest Expert: Incorporating expert knowledge to refine the dataset accuracy.