Building with AI Partners

Understanding AI's strengths and limitations is key to building effective partnerships. Let's focus on where human expertise remains essential and how to best leverage AI's capabilities.

1

Benchmark Current Capabilities

Evaluate AI performance across real-world engineering scenarios to identify current strengths and limitations

2

Curate Effective Challenges

Design programming challenges that focus on areas where human expertise remains essential

3

Continuous Evolution

Regularly reassess and update challenges as AI capabilities evolve, maintaining focus on human-AI partnership

The Rise of AI Coding Agents

AI coding agents have demonstrated remarkable capabilities in traditional programming challenges. In 2024, an AI agent achieved a gold medal in the International Olympiad in Informatics (IOI), a prestigious competition known for its challenging algorithmic problems. This achievement highlights the extraordinary progress AI has made in solving well-defined, algorithmic tasks.

"While AI agents excel at traditional programming competitions, these challenges represent only a fraction of real-world software development. Our assessment focuses on the complex, ambiguous, and collaborative aspects of modern software engineering that remain uniquely human."

However, traditional programming competitions and interviews often focus on precisely the types of problems where AI agents excel. For example, AlphaCode 2 recently achieved top performance in programming competitions, demonstrating exceptional ability in:

  • Algorithmic problem-solving
  • Data structure implementation
  • Mathematical computations
  • Single-file solutions
  • Well-defined problem statements

It's important to note that nearly any computer science problem can be solved using AI coding assistants today, given sufficiently detailed prompts and appropriate training data. However, this capability comes with a critical caveat: a person without domain knowledge, codebase context, or an understanding of production software engineering principles cannot effectively and safely solve real-world problems, even with the most advanced AI assistance.

"The effectiveness of AI coding assistants is directly proportional to the user's ability to provide proper context, understand the problem domain, and apply production software engineering principles. Without these human capabilities, AI assistance can lead to dangerous or ineffective outcomes."

Our methodology deliberately moves beyond these traditional challenges to focus on the aspects of software development where human expertise remains essential: system design, code review, performance optimization, and effective collaboration with AI agents.

Coding vs Engineering: Understanding the Difference

To truly understand where AI coding agents excel and where they fall short, we must distinguish between coding and engineering. While these terms are often used interchangeably, they represent fundamentally different aspects of software development.

Coding

  • Writing instructions for computers
  • Implementing algorithms and data structures
  • Solving well-defined problems
  • Creating individual components
  • Following specifications

Engineering

  • Designing systems that solve real problems
  • Making trade-offs between competing requirements
  • Ensuring reliability and maintainability
  • Managing technical debt and complexity
  • Coordinating across teams and systems

"AI coding assistants excel at coding tasks but struggle with engineering concerns. This is why they perform well in programming competitions (coding) but face challenges in production environments (engineering). The ability to make engineering decisions requires understanding not just how to write code, but why certain approaches are better than others in specific contexts."

The Critical Role of Prompt Engineering

The effectiveness of AI coding assistants is heavily dependent on the quality and structure of the prompts provided. Our research has identified several key factors that significantly impact the success rate of AI-assisted development:

Key Prompt Components

  • Clear problem statement and context
  • Specific requirements and constraints
  • Relevant codebase context and dependencies
  • Expected output format and structure
  • Performance and scalability requirements
  • Security and compliance considerations

Common Prompt Pitfalls

  • Vague or ambiguous requirements
  • Missing context about the existing system
  • Unclear success criteria
  • Insufficient technical constraints
  • Lack of business context
  • Ignoring edge cases and error handling

"The quality of the prompt directly correlates with the quality of the AI's output. A well-crafted prompt can dramatically improve success rates compared to basic or poorly structured prompts."

Prompt Evaluation Criteria

  • Clarity and specificity of requirements
  • Completeness of technical context
  • Inclusion of non-functional requirements
  • Consideration of edge cases and error scenarios
  • Alignment with business objectives
  • Feasibility of implementation

AI Performance Comparison

To understand where a gap exists, we must first acknowledge what AI-based coding agents are excellent at. These systems have demonstrated remarkable capabilities in:

  • Solving well-defined algorithmic problems
  • Implementing standard data structures and patterns
  • Writing boilerplate code and common implementations
  • Generating documentation and comments
  • Debugging simple, isolated issues
  • Providing code examples and explanations

The following tables demonstrate the contrast between AI performance on these traditional programming challenges versus enterprise-grade software development tasks. Our evaluation methodology is built on:

  • More than a decade of programming experience at leading tech companies
  • Extensive experience as a hiring manager and bar raiser at multiple tier-one tech companies
  • Deep expertise in evaluating engineering talent and setting technical standards
  • A comprehensive set of real-world engineering challenges
  • Consensus-based evaluation from multiple senior engineers
  • Continuous testing and validation against production scenarios

These performance scores are based on out-of-the-box prompts without significant prompt engineering. For example, a basic prompt like "Add in the ability to apply taxes to all items in our cart" might yield a 30% success rate. However, with a well-structured prompt that includes context about the existing codebase, tax calculation rules, edge cases, and performance requirements, the success rate can increase significantly.

Perhaps it is obvious to conclude, if a person has the ability to build an extremely specific prompt that is highly instructive, precise, technically sound, and elegantly written - a coding agent could score a high score across all of these dimensions. One day, we will have conversational coding agents that ask excellent questions to guide their decisions. Today that is not the case.

Detailed Enterprise Performance

SkillGPT-4-OGPT-4-MiniGPT-4GPT-3.5ClaudeCopilotCursorReplitWindsurfGrok
Basic Algorithms96929385959088829085
Basic Data Structures98949688989290859288
Basic Math Problems96929385959088829085
Data Structures42353218282522182522
Algorithms40323018282522182522
System Design35282515222018152018
Debugging32252212201815121815
Concurrency30222012181515121515
Multi-Threading28201810151512121515
Multi-Repo Traversal28221810151212101512
Code Optimization38322515222018152018
Legacy Code Maintenance32282012181515121515
Frontend Architecture38322815252220152218
Event-Driven Design35282515222018152018
A/B Testing201515812101081010
Configuration Driven Development25181810151212101212
Application Versioning28222012181515121515
Overall Enterprise Score32262313201716131816

Scoring Legend

0-20%
0-2 Years
21-40%
2-7 Years
41-60%
7-15 Years
61-80%
15+ Years
81-100%
Principal

Designing Programming Challenges for the AI Era

Our benchmarks reveal critical gaps in AI's capabilities. These insights can guide the creation of programming challenges that truly assess a candidate's ability to solve problems that AI cannot yet handle effectively.

Key Areas for Challenge Design

  • Data Driven state management across distributed systems
  • A/B Testing Stragies for Live Production Systems
  • Multi-repository system design and refactoring
  • Version Dependency Management
  • Performance optimization in production environments
  • Cross-team collaboration and code review scenarios

Challenge Design Principles

  • Focus on ambiguous requirements and trade-offs
  • Incorporate real-world constraints and limitations
  • Require understanding of business context
  • Test ability to work with existing codebases
  • Evaluate architectural decision-making
  • Assess collaboration and communication skills

"The best programming challenges today aren't about solving algorithmic puzzles - they're about demonstrating the ability to navigate complex, real-world engineering scenarios where AI still falls short."

Example Challenge Structure

A well-designed challenge might present a scenario like:

"A major customer reports that their users are experiencing intermittent 500 errors during checkout. The issue appears to be related to payment processing, but the error logs are inconsistent across services. The system spans multiple repositories, including a legacy payment gateway, a modern microservice for order processing, and a shared authentication service. Production traffic is 10,000 requests per minute, and the error rate is 0.1%. Design an approach to diagnose and resolve this issue while minimizing impact on production traffic."

Evaluation Criteria

  • Diagnostic approach and root cause analysis methodology
  • Understanding of distributed system debugging techniques
  • Production impact assessment and risk mitigation
  • Log analysis and monitoring strategy
  • Communication plan for stakeholders
  • Implementation of safeguards and monitoring improvements

The Future of Programming Challenges

While today's benchmarks reveal significant gaps in AI capabilities, it's inevitable that agentic programming will continue to advance. The true measure of a software engineer's value will increasingly shift from individual coding prowess to their ability to effectively partner with AI systems to achieve complex outcomes.

"The best engineers of tomorrow won't be those who can code better than AI, but those who can best leverage AI to solve problems that neither could solve alone."

Future Challenge Structure

A forward-looking challenge might present a scenario like:

"You're leading a team of AI agents to modernize a legacy payment system. The system processes $1B+ in transactions annually across multiple repositories. Your team consists of three specialized AI agents: one for system analysis, one for code generation, and one for testing. Design and execute a modernization strategy that:

  • Maximizes the strengths of each AI agent
  • Identifies and mitigates potential AI blind spots
  • Maintains system reliability during the transition
  • Ensures compliance with evolving security standards
  • Optimizes the human-AI collaboration workflow

Evaluation Criteria

  • AI agent selection and role assignment
  • Workflow design and orchestration
  • Quality assurance and validation strategy
  • Risk management and mitigation planning
  • Communication and coordination effectiveness
  • Business impact and value delivery

Key Skills for the Future

Technical Leadership

  • AI agent selection and management
  • System architecture and design
  • Quality assurance and validation
  • Performance optimization

Strategic Thinking

  • Risk assessment and mitigation
  • Business value optimization
  • Resource allocation and planning
  • Stakeholder communication