{"id":306,"date":"2026-02-25T09:29:22","date_gmt":"2026-02-25T17:29:22","guid":{"rendered":"https:\/\/scienceblog.com\/neuroedge\/?p=306"},"modified":"2026-02-25T09:29:22","modified_gmt":"2026-02-25T17:29:22","slug":"ai-scores-3-on-the-hardest-test-humans-could-write","status":"publish","type":"post","link":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/","title":{"rendered":"AI Scores 3% on the Hardest Test Humans Could Write"},"content":{"rendered":"<p>Somewhere in the archive of Humanity&#8217;s Last Exam sits a question about a Roman tombstone. Not the Latin inscription \u2014 that would be too easy \u2014 but the Palmyrene script running alongside it, a language spoken in ancient Syria and dead for seventeen centuries. The question was written by a classicist at Oxford, tested against every major AI system available, and passed into the dataset only after all of them failed. It is one of 2,500 such questions. Together they constitute the most rigorous attempt yet to find the ceiling of what artificial intelligence actually knows.<\/p>\n<p>The need for such an exercise became urgent because the ceiling of existing tests had long since been scraped. Models from OpenAI, Google, Anthropic and others now exceed 90 per cent accuracy on MMLU \u2014 the Massive Multitask Language Understanding benchmark that was, only a few years ago, considered a meaningful measure of machine intelligence. A test that most PhD students would struggle with has become, for frontier AI, something close to routine.<\/p>\n<p>So a global consortium \u2014 nearly 1,000 subject-matter experts affiliated with more than 500 institutions across 50 countries \u2014 spent months designing questions that might actually matter. The result, Humanity&#8217;s Last Exam (HLE), was published in Nature in January and covers mathematics, biology, linguistics, chemistry, history, computer science and much else besides. Its questions require not internet retrieval but genuine reasoning: how many paired tendons are supported by a specific sesamoid bone unique to hummingbirds? Which class of graphs satisfies a particular convergence property in Markov chains? The question-setters were mostly professors and graduate researchers, each one working in territory AI couldn&#8217;t easily follow. Those it could follow were cut.<\/p>\n<p>The filtering process alone tells you something. More than 70,000 AI attempts were logged during question development; roughly 13,000 stumped the models sufficiently to proceed to human expert review. Of those, 2,500 survived to become HLE. Each surviving question had to have a single unambiguous answer, verifiable by a domain expert, resistant to web search.<\/p>\n<p>When the benchmark was published and the frontier models were finally tested against it properly, the scores were not encouraging \u2014 if you were hoping for superintelligence. GPT-4o managed 2.7 per cent. Claude 3.5 Sonnet reached 4.1. OpenAI&#8217;s reasoning-specialist o1, the system explicitly designed to think harder before answering, achieved 8. More recent models have done better: GPT-5, released after HLE was made public, scored around 25 per cent. But the benchmark was engineered to resist saturation. Even a quarter correct leaves three-quarters wrong.<\/p>\n<p>What makes the results especially telling isn&#8217;t just the accuracy numbers. It&#8217;s the calibration. When a model is wrong, does it know it&#8217;s probably wrong? Well-calibrated systems should hedge on hard questions, expressing lower confidence when they&#8217;re guessing. HLE found the opposite: most frontier models exhibited calibration errors above 70 per cent, meaning they were consistently wrong in ways they didn&#8217;t recognise. Confident and mistaken is a more alarming failure mode than uncertain and mistaken.<\/p>\n<p>There is one more curious finding in the data. Reasoning models \u2014 those designed to generate extended chains of thought before committing to an answer \u2014 do improve with more thinking, up to a point. Feed them more tokens to reason with and accuracy climbs on a roughly log-linear curve. But beyond about 16,000 reasoning tokens, the trend reverses. More deliberation starts to hurt. Why this happens isn&#8217;t yet understood, but it suggests that simply scaling up compute at inference time isn&#8217;t a path to expert-level knowledge.<\/p>\n<p>Tung Nguyen, an instructional associate professor in computer science and engineering at Texas A&amp;M University, contributed more questions to HLE&#8217;s mathematics and computer science sections than almost anyone else \u2014 73 in total, the second-highest count among nearly 1,000 contributors. He is cautious about what the results mean. &#8220;When AI systems start performing extremely well on human benchmarks, it&#8217;s tempting to think they&#8217;re approaching human-level understanding,&#8221; he said. &#8220;But HLE reminds us that intelligence isn&#8217;t just about pattern recognition \u2014 it&#8217;s about depth, context and specialized expertise.&#8221;<\/p>\n<p>He is equally cautious about the practical stakes. Without rigorous benchmarks, he argues, the risks multiply. &#8220;Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do. Benchmarks provide the foundation for measuring progress and identifying risks.&#8221;<\/p>\n<p>The name Humanity&#8217;s Last Exam invites a particular kind of reading \u2014 the final test before the machines win, the last line of human intellectual defence. Nguyen pushes back on this. &#8220;This isn&#8217;t a race against AI. It&#8217;s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies.&#8221; The diversity of question-setters was itself the point: historians, physicists, linguists and medical researchers all probing different corners of knowledge, precisely because different corners catch different failures. &#8220;Perhaps ironically,&#8221; Nguyen said, &#8220;it&#8217;s humans working together&#8221; that exposes the gaps.<\/p>\n<p>There is a $500,000 prize pool attached to the effort, with top-ranked questions earning $5,000 each \u2014 a signal of how seriously the organisers took the quality problem. The questions keep coming in. And the models keep improving, which means HLE-Rolling, a dynamic version of the dataset, will update as frontier performance nudges upward. &#8220;For now,&#8221; Nguyen said, &#8220;Humanity&#8217;s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence \u2014 and despite rapid technological advances, it remains wide.&#8221;<\/p>\n<p>Study link: <a href=\"https:\/\/www.nature.com\/articles\/s41586-025-09962-4\">https:\/\/www.nature.com\/articles\/s41586-025-09962-4<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Somewhere in the archive of Humanity&#8217;s Last Exam sits a question about a Roman tombstone. Not the Latin inscription \u2014 that would be too easy \u2014 but the Palmyrene script running alongside it, a language spoken in ancient Syria and dead for seventeen centuries. The question was written by a classicist at Oxford, tested against &#8230; <a title=\"AI Scores 3% on the Hardest Test Humans Could Write\" class=\"read-more\" href=\"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/\" aria-label=\"Read more about AI Scores 3% on the Hardest Test Humans Could Write\">Read more<\/a><\/p>\n","protected":false},"author":1297,"featured_media":307,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[4,10,9,6],"tags":[],"class_list":["post-306","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computational-innovation","category-ethics","category-society","category-technology","generate-columns","tablet-grid-50","mobile-grid-100","grid-parent","grid-50"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.6 (Yoast SEO v27.6) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>AI Scores 3% on the Hardest Test Humans Could Write - NeuroEdge<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"AI Scores 3% on the Hardest Test Humans Could Write\" \/>\n<meta property=\"og:description\" content=\"Somewhere in the archive of Humanity&#8217;s Last Exam sits a question about a Roman tombstone. Not the Latin inscription \u2014 that would be too easy \u2014 but the Palmyrene script running alongside it, a language spoken in ancient Syria and dead for seventeen centuries. The question was written by a classicist at Oxford, tested against ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/\" \/>\n<meta property=\"og:site_name\" content=\"NeuroEdge\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-25T17:29:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"506\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"NeuroEdge\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"NeuroEdge\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/\"},\"author\":{\"name\":\"NeuroEdge\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#\\\/schema\\\/person\\\/a13c664778e7eb97cb71e3e1ad356d2e\"},\"headline\":\"AI Scores 3% on the Hardest Test Humans Could Write\",\"datePublished\":\"2026-02-25T17:29:22+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/\"},\"wordCount\":905,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2026\\\/02\\\/man-vs-robot-1408x792-1.jpg\",\"articleSection\":[\"Computational Innovation\",\"Ethics\",\"Society\",\"Technology\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#respond\"]}],\"copyrightYear\":\"2026\",\"copyrightHolder\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/#organization\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/\",\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/\",\"name\":\"AI Scores 3% on the Hardest Test Humans Could Write - NeuroEdge\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2026\\\/02\\\/man-vs-robot-1408x792-1.jpg\",\"datePublished\":\"2026-02-25T17:29:22+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#primaryimage\",\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2026\\\/02\\\/man-vs-robot-1408x792-1.jpg\",\"contentUrl\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2026\\\/02\\\/man-vs-robot-1408x792-1.jpg\",\"width\":900,\"height\":506,\"caption\":\"Despite its apocalyptic name, Humanity\u2019s Last Exam isn\u2019t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/2026\\\/02\\\/25\\\/ai-scores-3-on-the-hardest-test-humans-could-write\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI Scores 3% on the Hardest Test Humans Could Write\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#website\",\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/\",\"name\":\"NeuroEdge\",\"description\":\"A data-driven look at neuroscience and AI, for investors, policymakers, and innovators.\",\"publisher\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#organization\",\"name\":\"NeuroEdge\",\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2025\\\/04\\\/cropped-neuroedge_logo.jpg\",\"contentUrl\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/wp-content\\\/uploads\\\/sites\\\/14\\\/2025\\\/04\\\/cropped-neuroedge_logo.jpg\",\"width\":955,\"height\":191,\"caption\":\"NeuroEdge\"},\"image\":{\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/#\\\/schema\\\/person\\\/a13c664778e7eb97cb71e3e1ad356d2e\",\"name\":\"NeuroEdge\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g\",\"caption\":\"NeuroEdge\"},\"url\":\"https:\\\/\\\/scienceblog.com\\\/neuroedge\\\/author\\\/neuroedge\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"AI Scores 3% on the Hardest Test Humans Could Write - NeuroEdge","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/","og_locale":"en_US","og_type":"article","og_title":"AI Scores 3% on the Hardest Test Humans Could Write","og_description":"Somewhere in the archive of Humanity&#8217;s Last Exam sits a question about a Roman tombstone. Not the Latin inscription \u2014 that would be too easy \u2014 but the Palmyrene script running alongside it, a language spoken in ancient Syria and dead for seventeen centuries. The question was written by a classicist at Oxford, tested against ... Read more","og_url":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/","og_site_name":"NeuroEdge","article_published_time":"2026-02-25T17:29:22+00:00","og_image":[{"width":900,"height":506,"url":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","type":"image\/jpeg"}],"author":"NeuroEdge","twitter_card":"summary_large_image","twitter_misc":{"Written by":"NeuroEdge","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#article","isPartOf":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/"},"author":{"name":"NeuroEdge","@id":"https:\/\/scienceblog.com\/neuroedge\/#\/schema\/person\/a13c664778e7eb97cb71e3e1ad356d2e"},"headline":"AI Scores 3% on the Hardest Test Humans Could Write","datePublished":"2026-02-25T17:29:22+00:00","mainEntityOfPage":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/"},"wordCount":905,"commentCount":0,"publisher":{"@id":"https:\/\/scienceblog.com\/neuroedge\/#organization"},"image":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#primaryimage"},"thumbnailUrl":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","articleSection":["Computational Innovation","Ethics","Society","Technology"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#respond"]}],"copyrightYear":"2026","copyrightHolder":{"@id":"https:\/\/scienceblog.com\/#organization"}},{"@type":"WebPage","@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/","url":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/","name":"AI Scores 3% on the Hardest Test Humans Could Write - NeuroEdge","isPartOf":{"@id":"https:\/\/scienceblog.com\/neuroedge\/#website"},"primaryImageOfPage":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#primaryimage"},"image":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#primaryimage"},"thumbnailUrl":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","datePublished":"2026-02-25T17:29:22+00:00","breadcrumb":{"@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#primaryimage","url":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","contentUrl":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","width":900,"height":506,"caption":"Despite its apocalyptic name, Humanity\u2019s Last Exam isn\u2019t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go."},{"@type":"BreadcrumbList","@id":"https:\/\/scienceblog.com\/neuroedge\/2026\/02\/25\/ai-scores-3-on-the-hardest-test-humans-could-write\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/scienceblog.com\/neuroedge\/"},{"@type":"ListItem","position":2,"name":"AI Scores 3% on the Hardest Test Humans Could Write"}]},{"@type":"WebSite","@id":"https:\/\/scienceblog.com\/neuroedge\/#website","url":"https:\/\/scienceblog.com\/neuroedge\/","name":"NeuroEdge","description":"A data-driven look at neuroscience and AI, for investors, policymakers, and innovators.","publisher":{"@id":"https:\/\/scienceblog.com\/neuroedge\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/scienceblog.com\/neuroedge\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/scienceblog.com\/neuroedge\/#organization","name":"NeuroEdge","url":"https:\/\/scienceblog.com\/neuroedge\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/scienceblog.com\/neuroedge\/#\/schema\/logo\/image\/","url":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/cropped-neuroedge_logo.jpg","contentUrl":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/cropped-neuroedge_logo.jpg","width":955,"height":191,"caption":"NeuroEdge"},"image":{"@id":"https:\/\/scienceblog.com\/neuroedge\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/scienceblog.com\/neuroedge\/#\/schema\/person\/a13c664778e7eb97cb71e3e1ad356d2e","name":"NeuroEdge","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28782ec992e8763e1f8d41ddc10864e7d8cd4cb99bacea6224c4abe634bbabec?s=96&d=mm&r=g","caption":"NeuroEdge"},"url":"https:\/\/scienceblog.com\/neuroedge\/author\/neuroedge\/"}]}},"jetpack_featured_media_url":"https:\/\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/02\/man-vs-robot-1408x792-1.jpg","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":291,"url":"https:\/\/scienceblog.com\/neuroedge\/2026\/01\/21\/the-creativity-threshold-when-ai-meets-the-average-mind\/","url_meta":{"origin":306,"position":0},"title":"The Creativity Threshold: When AI Meets the Average Mind","author":"NeuroEdge","date":"January 21, 2026","format":false,"excerpt":"Picture a task so simple it takes four minutes. Generate ten words. That's all. Make them as different from each other as possible, in every way that matters (meaning, usage, the way they sound in the mouth). This isn't a test you'd find in an IQ exam or written into\u2026","rel":"","context":"In &quot;Brain Health&quot;","block_context":{"text":"Brain Health","link":"https:\/\/scienceblog.com\/neuroedge\/category\/brain-health\/"},"img":{"alt_text":"abstract AI illustration","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/pexels-googledeepmind-18069158.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/pexels-googledeepmind-18069158.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/pexels-googledeepmind-18069158.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/pexels-googledeepmind-18069158.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":266,"url":"https:\/\/scienceblog.com\/neuroedge\/2025\/11\/21\/why-machines-cant-match-our-wildest-ideas\/","url_meta":{"origin":306,"position":1},"title":"Why Machines Can&#8217;t Match Our Wildest Ideas","author":"NeuroEdge","date":"November 21, 2025","format":false,"excerpt":"Creativity has never been a numbers game, and a new Australian analysis offers a stark reminder of just how far generative AI sits from human imagination. In a study grounded in mathematics, researchers show that today\u2019s large language models hit a ceiling long before they reach the ingenuity of society\u2019s\u2026","rel":"","context":"In &quot;Computational Innovation&quot;","block_context":{"text":"Computational Innovation","link":"https:\/\/scienceblog.com\/neuroedge\/category\/computational-innovation\/"},"img":{"alt_text":"person with a painted face","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/11\/pexels-mccutcheon-1209843.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/11\/pexels-mccutcheon-1209843.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/11\/pexels-mccutcheon-1209843.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/11\/pexels-mccutcheon-1209843.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":76,"url":"https:\/\/scienceblog.com\/neuroedge\/2025\/04\/24\/small-ai-models-match-giants-with-sift-algorithm\/","url_meta":{"origin":306,"position":2},"title":"Small AI Models Match Giants With SIFT Algorithm","author":"NeuroEdge","date":"April 24, 2025","format":false,"excerpt":"In a development that could reshape AI efficiency, researchers at ETH Zurich have created an algorithm enabling smaller language models to match the performance of systems 40 times their size, potentially solving one of AI's most persistent challenges. The method, called SIFT (Selecting Informative data for Fine-Tuning), targets the fundamental\u2026","rel":"","context":"In &quot;Computational Innovation&quot;","block_context":{"text":"Computational Innovation","link":"https:\/\/scienceblog.com\/neuroedge\/category\/computational-innovation\/"},"img":{"alt_text":"Researchers at ETH Zurich have developed a new algorithm that enhances large language models (LLMs), enabling them to generate more accurate and relevant answers. (Illustration: AI-generated \/ ETH Zurich)","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/colorful-illustration.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/colorful-illustration.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/colorful-illustration.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2025\/04\/colorful-illustration.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":326,"url":"https:\/\/scienceblog.com\/neuroedge\/2026\/05\/15\/teaching-machines-to-listen-to-all-their-sensors-at-once\/","url_meta":{"origin":306,"position":3},"title":"Teaching Machines to Listen to All Their Sensors at Once","author":"NeuroEdge","date":"May 15, 2026","format":false,"excerpt":"Somewhere inside a large manufacturing plant, a turbofan bearing is beginning to fail. It will not announce this clearly. One vibration sensor picks up a faint irregularity in the x-axis; another registers a slight temperature drift; a third is recording torque anomalies that might mean nothing at all. Each sensor,\u2026","rel":"","context":"In &quot;Automation &amp; Efficiency&quot;","block_context":{"text":"Automation &amp; Efficiency","link":"https:\/\/scienceblog.com\/neuroedge\/category\/automation-efficiency\/"},"img":{"alt_text":"The new method uses deep neural networks to combine data from multiple sources more effectively. Tests show it outperforms existing approaches on standard benchmarks, with strong potential for use in automation, intelligent control, and data-driven engineering.","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/05\/correlation-diagram.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/05\/correlation-diagram.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/05\/correlation-diagram.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/05\/correlation-diagram.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":323,"url":"https:\/\/scienceblog.com\/neuroedge\/2026\/04\/24\/why-a-slower-ai-might-actually-feel-smarter-to-you\/","url_meta":{"origin":306,"position":4},"title":"Why a Slower AI Might Actually Feel Smarter to You","author":"NeuroEdge","date":"April 24, 2026","format":false,"excerpt":"Type a question into an AI chatbot and hit send. Now watch the cursor blink. Two seconds feels fine, barely noticeable. Nine seconds and something shifts: you start to wonder if the system is really working through the problem. By the time you hit twenty seconds you have either concluded\u2026","rel":"","context":"In &quot;Computational Innovation&quot;","block_context":{"text":"Computational Innovation","link":"https:\/\/scienceblog.com\/neuroedge\/category\/computational-innovation\/"},"img":{"alt_text":"Deepseek screenshot","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/04\/pexels-bertellifotografia-30530410.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/04\/pexels-bertellifotografia-30530410.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/04\/pexels-bertellifotografia-30530410.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/04\/pexels-bertellifotografia-30530410.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":296,"url":"https:\/\/scienceblog.com\/neuroedge\/2026\/01\/23\/ai-is-making-scientists-stars-while-dimming-the-light-of-discovery\/","url_meta":{"origin":306,"position":5},"title":"AI Is Making Scientists Stars While Dimming the Light of Discovery","author":"NeuroEdge","date":"January 23, 2026","format":false,"excerpt":"Imagine you\u2019re a PhD student named Leo. You have two choices. You could spend the next five years in a dusty basement lab, trying to figure out a \"weird\" question about how the very first molecules of life sparked into existence. There\u2019s no data to help you, the experiments often\u2026","rel":"","context":"In &quot;Society&quot;","block_context":{"text":"Society","link":"https:\/\/scienceblog.com\/neuroedge\/category\/society\/"},"img":{"alt_text":"picasso style abstract science illustration","src":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/ai-science-2.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/ai-science-2.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/ai-science-2.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/scienceblog.com\/neuroedge\/wp-content\/uploads\/sites\/14\/2026\/01\/ai-science-2.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/posts\/306","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/users\/1297"}],"replies":[{"embeddable":true,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/comments?post=306"}],"version-history":[{"count":2,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/posts\/306\/revisions"}],"predecessor-version":[{"id":309,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/posts\/306\/revisions\/309"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/media\/307"}],"wp:attachment":[{"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/media?parent=306"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/categories?post=306"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/scienceblog.com\/neuroedge\/wp-json\/wp\/v2\/tags?post=306"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}