onlyTrustedInfo.comonlyTrustedInfo.comonlyTrustedInfo.com
Notification
Font ResizerAa
  • News
  • Finance
  • Sports
  • Life
  • Entertainment
  • Tech
Reading: A high schooler built a website that lets you challenge AI models to a Minecraft build-off
Share
onlyTrustedInfo.comonlyTrustedInfo.com
Font ResizerAa
  • News
  • Finance
  • Sports
  • Life
  • Entertainment
  • Tech
Search
  • News
  • Finance
  • Sports
  • Life
  • Entertainment
  • Tech
  • Advertise
  • Advertise
© 2025 OnlyTrustedInfo.com . All Rights Reserved.
Tech

A high schooler built a website that lets you challenge AI models to a Minecraft build-off

Last updated: March 20, 2025 4:11 pm
Oliver James
Share
4 Min Read
A high schooler built a website that lets you challenge AI models to a Minecraft build-off
SHARE

As conventional AI benchmarking techniques prove inadequate, AI builders are turning to more creative ways to assess the capabilities of generative AI models. For one group of developers, that’s Minecraft, the Microsoft-owned sandbox-building game.

The website Minecraft Benchmark (or MC-Bench) was developed collaboratively to pit AI models against each other in head-to-head challenges to respond to prompts with Minecraft creations. Users can vote on which model did a better job, and only after voting can they see which AI made each Minecraft build.

Image Credits:Minecraft Benchmark (opens in a new window)

For Adi Singh, the 12th grader who started MC-Bench, the value of Minecraft isn’t so much the game itself, but the familiarity that people have with it — after all, it is the best-selling video game of all time. Even for people who haven’t played the game, it’s still possible to evaluate which blocky representation of a pineapple is better realized.

“Minecraft allows people to see the progress [of AI development] much more easily,” Singh told TechCrunch. “People are used to Minecraft, used to the look and the vibe.”

MC-Bench currently lists eight people as volunteer contributors. Anthropic, Google, OpenAI, and Alibaba have subsidized the project’s use of their products to run benchmark prompts, per MC-Bench’s website, but the companies are not otherwise affiliated.

“Currently we are just doing simple builds to reflect on how far we’ve come from the GPT-3 era, but [we] could see ourselves scaling to these longer-form plans and goal-oriented tasks,” Singh said. “Games might just be a medium to test agentic reasoning that is safer than in real life and more controllable for testing purposes, making it more ideal in my eyes.”

Other games like Pokémon Red, Street Fighter, and Pictionary have been used as experimental benchmarks for AI, in part because the art of benchmarking AI is notoriously tricky.

Researchers often test AI models on standardized evaluations, but many of these tests give AI a home-field advantage. Because of the way they’re trained, models are naturally gifted at certain, narrow kinds of problem-solving, particularly problem-solving that requires rote memorization or basic extrapolation.

Put simply, it’s hard to glean what it means that OpenAI’s GPT-4 can score in the 88th percentile on the LSAT, but cannot discern how many Rs are in the word “strawberry.” Anthropic’s Claude 3.7 Sonnet achieved 62.3% accuracy on a standardized software engineering benchmark, but it is worse at playing Pokémon than most five-year-olds.

MC-Bench is technically a programming benchmark, since the models are asked to write code to create the prompted build, like “Frosty the Snowman” or “a charming tropical beach hut on a pristine sandy shore.”

But it’s easier for most MC-Bench users to evaluate whether a snowman looks better than to dig into code, which gives the project wider appeal — and thus the potential to collect more data about which models consistently score better.

Whether those scores amount to much in the way of AI usefulness is up for debate, of course. Singh asserts that they’re a strong signal, though.

“The current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks,” Singh said. “Maybe [MC-Bench] could be useful to companies to know if they’re heading in the right direction.”

You Might Also Like

EU iPhone users can set Google Maps as default navigation app

Scientists Detected Potential Signs of Life on a Distant Planet. This Is Not a Drill.

Eastern half of U.S. braces for more long days of dangerous heat

3 missing, house swept away as flash flooding hits mountain village in New Mexico

Cofertility lets women freeze their eggs for free through its donor-matching program

Share This Article
Facebook X Copy Link Print
Share
Previous Article Two men found guilty in 2022 Texas smuggling attempt that resulted in 53 migrant deaths Two men found guilty in 2022 Texas smuggling attempt that resulted in 53 migrant deaths
Next Article Bucs hold inaugural ‘She is Football Weekend’ to increase NFL opportunities for women Bucs hold inaugural ‘She is Football Weekend’ to increase NFL opportunities for women

Latest News

Rubio says US officials are in Malaysia to help in Cambodia-Thailand talks
Rubio says US officials are in Malaysia to help in Cambodia-Thailand talks
News July 27, 2025
Cambodia says immediate ceasefire purpose of talks; Thailand questions its sincerity
Cambodia says immediate ceasefire purpose of talks; Thailand questions its sincerity
News July 27, 2025
Chilean investigators close in on the notorious Venezuelan gang targeted by Trump
Chilean investigators close in on the notorious Venezuelan gang targeted by Trump
News July 27, 2025
“Bend It Like Beckham” sequel in the works more than 20 years after the original
“Bend It Like Beckham” sequel in the works more than 20 years after the original
Entertainment July 27, 2025
//
  • About Us
  • Contact US
  • Privacy Policy
onlyTrustedInfo.comonlyTrustedInfo.com
© 2025 OnlyTrustedInfo.com . All Rights Reserved.