We’ve trained a system that solves grade school math problems with nearly twice the accuracy of a fine-tuned GPT-3 model. It solves about 90% as many problems as real kids: a small sample of 9-12 year olds scored 60% on a test from our dataset, while our system scored 55% on those same problems. This is important because today’s AI is still quite weak at commonsense multistep reasoning, which is easy even for grade school kids. We achieved these results by training our model to recognize its mistakes, so that it can try repeatedly until it finds a solution that works.
Large language models like GPT-3 have many impressive skills, including their ability to imitate many writing styles, and their extensive factual knowledge. However, they struggle to perform tasks that require accurate multistep reasoning, like solving grade school math word problems. Although the model can mimic the cadence of correct solutions, it regularly produces critical errors in logic.
To match human performance in complex logical domains, our models must learn to recognize their mistakes and to choose their steps carefully. To that end, we train verifiers to evaluate whether or not a proposed solution is correct. To solve a new problem, we use verifiers to select the best among many proposed solutions. We collected the new GSM8K dataset to evaluate our methods, and we are releasing this dataset to facilitate research.
In the ten examples below, we show solutions generated by our new method, verification, and our baseline method, fine-tuning.
Ali is a dean of a private school where he teaches one class. John is also a dean of a public school. John has two classes in his school. Each class has 1/8 the capacity of Ali’s class which has the capacity of 120 students. What is the combined capacity of both schools?Tim grows 5 trees. Each year he collects 6 lemons from each tree. How many lemons does he get in a decade?Anthony had 50 pencils. He gave 1/2 of his pencils to Brandon, and he gave 3/5 of the remaining pencils to Charlie. He kept the remaining pencils. How many pencils did Anthony keep?John drives to his friend’s house 200 miles away. He drives at a speed of 70 mph. He had to take a detour that added 10 miles to his trip. After he gets there he takes a route home that is 240 miles but he goes 80 mph. How long did the trip take?Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?Katherine has 5 hanging baskets to fill. In each basket she wants to add 3 petunias and 2 sweet potato vines. The petunias cost $3.00 apiece and the sweet potato vines cost $2.50 apiece. How much will she spend filling all 5 hanging baskets?Debra is monitoring a beehive to see how many bees come and go in a day. She sees 30 bees leave the hive in the first 6 hours, and then she sees 1/2 that many bees return in the next 6 hours. She sees two times as many bees as she saw first leave the hive fly from the hive and leave in the next 6 hours. Then every bee that left before that hadn’t already returned returns to the hive in the next 6 hours. How many bees did Debra see return to the hive in the last 6 hours of the day?Timothy’s locker is 24 cubic inches. Zack’s locker is half as big as Timothy’s locker. Peter’s locker is 1/4 as big as Zack’s locker. How big is Peter’s locker in cubic inches?Amalia, Megan, and Dior divided the home chores so that each person had something to do while the others were working. Amalia’s work was to mow the lawn, which took her 4 hours. Megan had to walk the dog and this took her 2 hours longer than Amalia to complete her chore. Dior’s work was to do laundry and she took well over 4 hours longer than the time Amalia took to mow the lawn. Calculate the total time they all took to do their chores altogether.John drinks a bottle of water every half hour. A normal sudoku puzzle takes him 45 minutes. An extreme sudoku takes 4 times that long. How many bottles of water does he drink in that time?
Method: 175B Verification
Ali’s class has a capacity of 120 students.Each of John’s classes has a capacity of 120/8 = 15 students.The total capacity of John’s two classes is 15 students * 2 classes = 30 students.The combined capacity of the two schools is 120 students + 30 students = 150 students.