Building with AI Partners

Understanding AI's strengths and limitations is key to building effective partnerships. Let's focus on where human expertise remains essential and how to best leverage AI's capabilities.

Explore Our Methodology

Our Process

Rise of AI Coding

Coding vs Engineering

Prompt Engineering

Performance Comparison

Designing A Challenge

Future of Engineering Challenges

Benchmark Current Capabilities

Evaluate AI performance across real-world engineering scenarios to identify current strengths and limitations

Curate Effective Challenges

Design programming challenges that focus on areas where human expertise remains essential

Continuous Evolution

Regularly reassess and update challenges as AI capabilities evolve, maintaining focus on human-AI partnership

The Rise of AI Coding Agents

AI coding agents have demonstrated remarkable capabilities in traditional programming challenges. In 2024, an AI agent achieved a gold medal in the International Olympiad in Informatics (IOI), a prestigious competition known for its challenging algorithmic problems. This achievement highlights the extraordinary progress AI has made in solving well-defined, algorithmic tasks.

"While AI agents excel at traditional programming competitions, these challenges represent only a fraction of real-world software development. Our assessment focuses on the complex, ambiguous, and collaborative aspects of modern software engineering that remain uniquely human."

However, traditional programming competitions and interviews often focus on precisely the types of problems where AI agents excel. For example, AlphaCode 2 recently achieved top performance in programming competitions, demonstrating exceptional ability in:

Algorithmic problem-solving
Data structure implementation
Mathematical computations
Single-file solutions
Well-defined problem statements

It's important to note that nearly any computer science problem can be solved using AI coding assistants today, given sufficiently detailed prompts and appropriate training data. However, this capability comes with a critical caveat: a person without domain knowledge, codebase context, or an understanding of production software engineering principles cannot effectively and safely solve real-world problems, even with the most advanced AI assistance.

"The effectiveness of AI coding assistants is directly proportional to the user's ability to provide proper context, understand the problem domain, and apply production software engineering principles. Without these human capabilities, AI assistance can lead to dangerous or ineffective outcomes."

Our methodology deliberately moves beyond these traditional challenges to focus on the aspects of software development where human expertise remains essential: system design, code review, performance optimization, and effective collaboration with AI agents.

Coding vs Engineering: Understanding the Difference

To truly understand where AI coding agents excel and where they fall short, we must distinguish between coding and engineering. While these terms are often used interchangeably, they represent fundamentally different aspects of software development.

Coding

Writing instructions for computers
Implementing algorithms and data structures
Solving well-defined problems
Creating individual components
Following specifications

Engineering

Designing systems that solve real problems
Making trade-offs between competing requirements
Ensuring reliability and maintainability
Managing technical debt and complexity
Coordinating across teams and systems

"AI coding assistants excel at coding tasks but struggle with engineering concerns. This is why they perform well in programming competitions (coding) but face challenges in production environments (engineering). The ability to make engineering decisions requires understanding not just how to write code, but why certain approaches are better than others in specific contexts."

The Critical Role of Prompt Engineering

The effectiveness of AI coding assistants is heavily dependent on the quality and structure of the prompts provided. Our research has identified several key factors that significantly impact the success rate of AI-assisted development:

Key Prompt Components

Clear problem statement and context
Specific requirements and constraints
Relevant codebase context and dependencies
Expected output format and structure
Performance and scalability requirements
Security and compliance considerations

Common Prompt Pitfalls

Vague or ambiguous requirements
Missing context about the existing system
Unclear success criteria
Insufficient technical constraints
Lack of business context
Ignoring edge cases and error handling

"The quality of the prompt directly correlates with the quality of the AI's output. A well-crafted prompt can dramatically improve success rates compared to basic or poorly structured prompts."

Prompt Evaluation Criteria

Clarity and specificity of requirements
Completeness of technical context
Inclusion of non-functional requirements
Consideration of edge cases and error scenarios
Alignment with business objectives
Feasibility of implementation

AI Performance Comparison

To understand where a gap exists, we must first acknowledge what AI-based coding agents are excellent at. These systems have demonstrated remarkable capabilities in:

Solving well-defined algorithmic problems
Implementing standard data structures and patterns
Writing boilerplate code and common implementations
Generating documentation and comments
Debugging simple, isolated issues
Providing code examples and explanations

The following tables demonstrate the contrast between AI performance on these traditional programming challenges versus enterprise-grade software development tasks. Our evaluation methodology is built on:

More than a decade of programming experience at leading tech companies
Extensive experience as a hiring manager and bar raiser at multiple tier-one tech companies
Deep expertise in evaluating engineering talent and setting technical standards
A comprehensive set of real-world engineering challenges
Consensus-based evaluation from multiple senior engineers
Continuous testing and validation against production scenarios

These performance scores are based on out-of-the-box prompts without significant prompt engineering. For example, a basic prompt like "Add in the ability to apply taxes to all items in our cart" might yield a 30% success rate. However, with a well-structured prompt that includes context about the existing codebase, tax calculation rules, edge cases, and performance requirements, the success rate can increase significantly.

Perhaps it is obvious to conclude, if a person has the ability to build an extremely specific prompt that is highly instructive, precise, technically sound, and elegantly written - a coding agent could score a high score across all of these dimensions. One day, we will have conversational coding agents that ask excellent questions to guide their decisions. Today that is not the case.

Detailed Enterprise Performance

Skill	GPT-4-O	GPT-4-Mini	GPT-4	GPT-3.5	Claude	Copilot	Cursor	Replit	Windsurf	Grok
Basic Algorithms	96	92	93	85	95	90	88	82	90	85
Basic Data Structures	98	94	96	88	98	92	90	85	92	88
Basic Math Problems	96	92	93	85	95	90	88	82	90	85
Data Structures	42	35	32	18	28	25	22	18	25	22
Algorithms	40	32	30	18	28	25	22	18	25	22
System Design	35	28	25	15	22	20	18	15	20	18
Debugging	32	25	22	12	20	18	15	12	18	15
Concurrency	30	22	20	12	18	15	15	12	15	15
Multi-Threading	28	20	18	10	15	15	12	12	15	15
Multi-Repo Traversal	28	22	18	10	15	12	12	10	15	12
Code Optimization	38	32	25	15	22	20	18	15	20	18
Legacy Code Maintenance	32	28	20	12	18	15	15	12	15	15
Frontend Architecture	38	32	28	15	25	22	20	15	22	18
Event-Driven Design	35	28	25	15	22	20	18	15	20	18
A/B Testing	20	15	15	8	12	10	10	8	10	10
Configuration Driven Development	25	18	18	10	15	12	12	10	12	12
Application Versioning	28	22	20	12	18	15	15	12	15	15
Overall Enterprise Score	32	26	23	13	20	17	16	13	18	16

Scoring Legend

0-20%

0-2 Years

21-40%

2-7 Years

41-60%

7-15 Years

61-80%

15+ Years

81-100%

Principal

Designing Programming Challenges for the AI Era

Our benchmarks reveal critical gaps in AI's capabilities. These insights can guide the creation of programming challenges that truly assess a candidate's ability to solve problems that AI cannot yet handle effectively.

Key Areas for Challenge Design

Data Driven state management across distributed systems
A/B Testing Stragies for Live Production Systems
Multi-repository system design and refactoring
Version Dependency Management
Performance optimization in production environments
Cross-team collaboration and code review scenarios

Challenge Design Principles

Focus on ambiguous requirements and trade-offs
Incorporate real-world constraints and limitations
Require understanding of business context
Test ability to work with existing codebases
Evaluate architectural decision-making
Assess collaboration and communication skills

"The best programming challenges today aren't about solving algorithmic puzzles - they're about demonstrating the ability to navigate complex, real-world engineering scenarios where AI still falls short."

Example Challenge Structure

A well-designed challenge might present a scenario like:

"A major customer reports that their users are experiencing intermittent 500 errors during checkout. The issue appears to be related to payment processing, but the error logs are inconsistent across services. The system spans multiple repositories, including a legacy payment gateway, a modern microservice for order processing, and a shared authentication service. Production traffic is 10,000 requests per minute, and the error rate is 0.1%. Design an approach to diagnose and resolve this issue while minimizing impact on production traffic."

Evaluation Criteria

Diagnostic approach and root cause analysis methodology
Understanding of distributed system debugging techniques
Production impact assessment and risk mitigation
Log analysis and monitoring strategy
Communication plan for stakeholders
Implementation of safeguards and monitoring improvements

The Future of Programming Challenges

While today's benchmarks reveal significant gaps in AI capabilities, it's inevitable that agentic programming will continue to advance. The true measure of a software engineer's value will increasingly shift from individual coding prowess to their ability to effectively partner with AI systems to achieve complex outcomes.

"The best engineers of tomorrow won't be those who can code better than AI, but those who can best leverage AI to solve problems that neither could solve alone."

Future Challenge Structure

A forward-looking challenge might present a scenario like:

"You're leading a team of AI agents to modernize a legacy payment system. The system processes $1B+ in transactions annually across multiple repositories. Your team consists of three specialized AI agents: one for system analysis, one for code generation, and one for testing. Design and execute a modernization strategy that:

Maximizes the strengths of each AI agent
Identifies and mitigates potential AI blind spots
Maintains system reliability during the transition
Ensures compliance with evolving security standards
Optimizes the human-AI collaboration workflow

Evaluation Criteria

AI agent selection and role assignment
Workflow design and orchestration
Quality assurance and validation strategy
Risk management and mitigation planning
Communication and coordination effectiveness
Business impact and value delivery

Building with AI Partners

Explore Our Methodology

Benchmark Current Capabilities

Curate Effective Challenges

Continuous Evolution

The Rise of AI Coding Agents

Coding vs Engineering: Understanding the Difference

Coding

Engineering

The Critical Role of Prompt Engineering

Key Prompt Components

Common Prompt Pitfalls

Prompt Evaluation Criteria

AI Performance Comparison

Detailed Enterprise Performance

Scoring Legend

Designing Programming Challenges for the AI Era

Key Areas for Challenge Design

Challenge Design Principles

Example Challenge Structure

Evaluation Criteria

The Future of Programming Challenges

Future Challenge Structure

Evaluation Criteria

Key Skills for the Future

Technical Leadership

Strategic Thinking