The Project

What is the Pacino project about?

This project documents the co-design of a high-performance, RVA23S64-compliant, 8-issue out-of-order RISC-V core, called Pacino. The goal is an investigation of LLM co-design methodology on a substantial hardware design task. We chose RISC-V because it is an open ISA with open-source designs and a significant body of related literature. We will also be exploring a methodology that allows hardware design innovation, which we will describe as it is developed.

The hypothesis we are testing is that a domain expert directing an LLM using a structured co-design process can produce production-quality RTL with fewer resources and a smaller team. For the purpose of evaluating the methodology we will document the process, results, successes and failures for further analysis, replication and extension.

What is new about this project?

Prior AI-assisted RTL work has tended towards simple pipelines or isolated modules. We believe these targets do not fully expose AI failure modes. This project runs a context-isolated agent methodology against a standards-compliant, high-performance OoO core where microarchitectural judgment has implications for PPA and mistakes compound across sessions. We believe this is a different class of test.

Beyond the RTL, the methodology is also a first-class artifact. Every session produces a structured co-design record: prompts, iteration history, results capture, and explicit documentation of where the AI was successful or not, and why. The co-design record captures results at the session level — where in the iteration loop the AI went wrong, or right, what correction was required, where the AI made correct architectural assumptions not explicitly specified, and what that implies for prompt strategy.

The methodology uses a dual-agent scheme — a strategic planning assistant (PA) for architectural decisions and a separate implementation assistant (IA) for RTL generation — with enforced context isolation between IA sessions and no shared context between PA and IA. The intent is controlled experiments for prompting strategy: each session is a clean trial, to create a baseline contextual state.

To our knowledge, the combination of a complex OoO design target, expert-directed methodology, and a published co-design record with session-level analysis has no direct analog in the published work.

Where is the design today?

This is the very start of the project. We intend periodic updates in terms of articles discussing recent development and a commit stream.

What is the expected development sequence?

Frontend-first. The frontend cluster — branch predictor, FTQ, IFU, I-cache, pre-decode, instruction buffer — is the current phase. Once the frontend reaches sufficient maturity, the midcore (rename/dispatch, register file, ROB) and execution backend (integer ALUs, FPU, vector execution) follow, with special focus on load/store, and then the memory system (L1D, L2, TLB/PTW) developed in parallel where interfaces permit.

Clean interface contracts are defined at each cluster boundary before downstream implementation begins. This makes co-design sessions tractable — each module has a real, fixed consumer it must satisfy rather than an abstract specification.

The Design

What RISC-V profile does this processor target, and what does RVA23 compliance require?

The target is RVA23S64 — the 2024 ratified high-performance application processor profile for 64-bit RISC-V. It is a substantial extension set beyond baseline RV64GC.

RVA23S64 includes the baseline RV64I, compressed, double and single floating point, vector and H (hypervisor) extensions, scalar cryptography, cache management operations, and a range of bit-manipulation and floating-point sub-extensions. RVA23 compliance is tracked per module — known gaps, risks and tracked compliance gaps are documented in the co-design record.

The full ISA string used for compilation and spike validation is:

rv64imafdc_v_h_sscofpmf_sstc_svinval_svnapot_svpbmt_zawrs_zba_zbb_zbc_zbs_zfa_zfh
_zfhmin_zicbom_zicboz_zicntr_zifencei_zicond_zihintntl_zihintpause_zihpm_zkt_zkn
_zknd_zkne_zknh_zbkb_zbkc_zbkx_zicbop_zcb_zvbb_zvfh_zvfhmin_zvkt_zacas_zvl128b
_zksed_zksh

Other extensions:

smaia ssaia smcsrind sscsrind smdbltrp ssdbltrp smmpm smnpm ssnpm sspm supm smrnmi
smstateen ssstateen svade svbare sv39 sv48 shcounterenw shgatpa shlcofideleg
shtvala shvsatpa shvstvala shvstvecd ss1p13 ssccptr sscounterenw ssstrict
sstvala sstvecd ssu64xl sdtrig sha zic64b ziccamoa ziccif zicclsm ziccrse
za64rs zimop zcmop zicsr
Why 8-issue? What drove that microarchitectural decision?

8-issue puts the design in the high-performance application processor tier — comparable to the widest cores in modern mobile and server SoCs. A narrower machine makes the co-design methodology less interesting to test: scheduling pressure, fetch bandwidth, branch predictor complexity, and rename/dispatch design all become harder and more consequential at 8-issue. That is where methodology choices matter.

It is also a deliberate research stake. If AI-assisted co-design can produce correct, synthesizable RTL at this width, that is a stronger result than demonstrating it on a 2-issue in-order pipeline where the design space is much more constrained.

Performance

What is the performance target?

Our performance target is SPECint2006 20/GHz — aggressive by current published RISC-V standards.

This is an AI-assisted methodology driven by a small team. A result in the +15/GHz range remains a strong proof point for the methodology and competitive with published RISC-V designs.

The combination of strong branch prediction, a decoupled frontend, instruction fusion, I/D prefetchers, a sophisticated load-store unit, and a well-tuned memory subsystem gives us confidence in the higher number.

The full optimization phase — methodology, tuning decisions, and results — will be published, and we expect it to be a meaningful contribution to the public record.

Published RISC-V SPECint2006 scores at time of writing:

Core Vendor Score (pts/GHz) Notes
Xuantie C910 Alibaba T-HEAD 6.11 Published at ISCA 2020
XiangShan Yanqihu (Gen 1) CAS / Open Source 7 Measured silicon, 28nm@1GHz
SiFive P550 SiFive 8.65 vendor published
SiFive P670 SiFive 13.2 Derived from P870 figure
Alibaba XuanTie C930 Alibaba T-HEAD >15 Claimed at launch, March 2025
XiangShan Kunminghu (Gen 3) CAS / Open Source 16.5 Measured silicon, announced March 2026
SiFive P870 SiFive >17 6-issue IP core; SiFive pre-silicon estimate, Hot Chips 2023
XiangShan Nanhu (Gen 2) CAS / Open Source ~19.10 Design estimate, 14nm, pre-silicon
Akeana 5300 Akeana 25 10-issue OOO, pre-silicon

All figures SPECint2006/GHz. Vendor-published figures are unverified unless noted.

What is the plan for performance validation?

The performance validation plan has two stages: a C++ cycle-accurate performance model and RTL simulation, correlated against each other and validated against SPEC CPU workloads. Correlation is done on named microarchitectural events: e.g. branch prediction results, cache hit/miss, instruction schedule/dispatch/retire, exceptions, etc.

SPEC CPU2006 and CPU2017 SimPoints drive projected IPC and score estimates from the performance model. Linux boot on FPGA then provides practical end-to-end correctness validation of the RTL before any silicon commitment.

Why use a performance model in addition to RTL simulation?

RTL simulation at full SPEC workload scale is prohibitively slow for design space exploration. A C++ performance model runs orders of magnitude faster, allowing rapid evaluation of microarchitectural tradeoffs — cache sizing, predictor configuration, issue width policy — before committing them to RTL. The model's value depends on how closely it correlates to the RTL, which is why the execution-event correlation methodology is a prerequisite.

Industry practice for RTL-to-model event correlation is largely proprietary; publishing the methodology alongside the model and RTL adds reusable value beyond the processor design itself.

Development Flow

What does the development flow look like end to end?

Each module follows a consistent arc. Architectural review sessions use the PA to resolve open design questions and lock decisions into a planning document before any RTL is written. The PA then produces a task file — a templated document specifying required context, constraints, deliverables, and acceptance criteria. The IA prompt is derived from the task file.

The IA works from the prompt and reports results back into the task file on completion. The domain expert and PA jointly assess those results and determine the next task file — which may extend the current work, close gaps, or change direction based on what the IA produced. Status is recorded in a project-level planning document and the cycle repeats.

PA sessions are bounded by context. When a session approaches its limit, the PA generates a structured handoff document from current state and the previous handoff. Every new PA session opens with that handoff and latest status, ensuring continuity without accumulated context drift.

A key element of the flow is human oversight of both PA and IA outputs, primarily to detect and correct goal drift — the tendency of both agents to expand scope, shift direction, or missize tasks over successive sessions. IA context usage and runtime are captured per session and folded into the analysis record.

What is the difference between the PA and IA roles, and why separate them?

PA (Planning Assistant) is Claude.ai, used for architectural discussion, design decision-making, methodology development, documentation, and session analysis. IA (Implementation Assistant) is Claude Code, used for RTL generation, testbench writing, and tool scripting against a precisely specified prompt.

The separation exists because the two tasks have fundamentally different context requirements and failure modes. The PA role needs broad design judgment and benefits from iterative dialogue. The IA role needs strict, reproducible execution against a fixed spec — context contamination between experiments invalidates comparisons. Mixing them produces neither well.

Why does context isolation matter — what goes wrong without it?

Every message in a conversation shapes how subsequent responses are generated. If you test prompt variant A then prompt variant B in the same conversation, B's output is partly a function of A's presence — the comparison is contaminated. For controlled experiments on prompting strategy, one experiment must equal one conversation.

There is also a subtler problem: a long implementation session builds up implicit shared assumptions between the architect and the model that are not written down. When that session ends, those assumptions disappear. If they were not captured in the handoff document, the next session produces silent inconsistencies that can take multiple rounds to diagnose.

What does a session handoff document contain and why is it necessary?

A session handoff (session_handoff-NNN.md) captures: what was completed in the current session, what decisions were made and why, deferred decisions, known issues flagged but not resolved, and the exact starting state for the next session. It is read alongside PROJECT_STATE.md and CLAUDE.md at the beginning of every implementation session.

Without the handoff, each session reconstructs context from the code itself — which is slow, error-prone, and systematically misses the reasoning behind decisions that are not obvious from the RTL. The handoff externalizes working memory so the implementation assistant starts with the full picture rather than a cold start.

Verification

How do you know the RTL works?

The first layer is directed testing. The PA defines expected behavior and acceptance criteria; the IA implements the test vectors from that specification. Keeping the two roles separate means the agent generating the RTL is never also the arbiter of what correct output looks like — a necessary condition for the results to be meaningful.

The next layer is functional coverage — directed tests target 95% functional point coverage. This is followed by formal verification using SymbiYosys, Halmos (symbolic testing), and riscv-formal.

We use the set of tests in riscv-tests to exercise the ISA implementation correctness. The same repository includes micro-benchmarks useful for both correctness and early performance correlation/verification.

In the later phases Linux boot is the ultimate goal. This is done in phases, initial snippets captured from Spike running OpenSBI and a buildRoot constructed minimal linux kernel will contribute to a Linux-focused test suite. Ultimately booting linux on an FPGA platform will give high confidence that the RTL "works".

For performance verification, a C++ performance model — correlated against the RTL via execution events — executes standard benchmarks and reports IPC and projected scores. Benchmarks include SPEC CPU2006, SPEC CPU2017, coremark-pro, coremark, and dhrystone. The riscv-tests repository also includes micro-benchmarks useful for performance characterisation and debug.

Since many of the directed tests and vectors are IA-generated, independent corroboration is a structural requirement of the verification strategy. The external test suites, formal tools, reference models, and standard benchmarks described above are each independent of the co-design process that produced the RTL — that independence is the point.

How are RVA23 compliance risks tracked during implementation?

The results capture section of every prompt includes a mandatory "RVA23 compliance risks and gaps noticed" field that the implementation assistant must fill in before a session is considered complete. This produces a live, module-level compliance risk register directly in the prompt files.

At the project level, PROJECT_STATUS.md maintains a technical debt table where compliance gaps are tracked with owner and resolution status. Gaps are never silently deferred — they are named, assigned a debt number, and either resolved in a subsequent experiment or explicitly carried as known limitations in the module's README.

Toolchain

What tools are used?
  • Design Language: SystemVerilog
  • RTL Simulator: Verilator 5.x
  • SW Compiler: GCC 16.x, LLVM 22.x
  • Functional Simulator: riscv-isa-sim (Spike)
  • Formal Verification: SymbiYosys, Halmos, riscv-formal
  • Mutation Testing: mcy (YosysHQ) -- planned
  • Gate Synthesis: Yosys
  • FPGA Mapping: Quartus / Vivado
  • Performance Model: riscv-perf-model (interim); ground-up C++ cycle-accurate model -- planned
  • Code Style/Lint: Verible
  • Trace Compression: STF Lib
  • ISA Compliance: riscv-opcodes, RISCOF, riscv-arch-test
  • Test Suites: riscv-tests, riscv-formal, uarchlabs-developed test suites
  • Linux Boot: Spike + OpenSBI, buildroot minimal kernel

AI Methods

What does "AI-assisted co-design" actually mean here?

LLMs are active participants in the design iteration loop — not just for RTL generation, but for exploring microarchitectural tradeoffs, critiquing proposed implementations, generating test stimulus, and documenting design rationale. A domain expert drives every session. The model accelerates iteration; the architect owns correctness and design decisions.

An important corollary: you cannot evaluate AI-generated RTL without the expertise to know if it's correct. The novice problem in AI-assisted hardware design is not just "how do you learn to design" — it's "how do you learn to judge." This project is designed to be studied by practitioners who already have that judgment.

Which models are used?

The specific model and version are documented in each co-design session record so results are reproducible to the extent the provider allows. We work across frontier models and note performance differences where they are relevant to the task — model selection is itself part of the methodology under study, not a fixed constant.

Do you publish the prompts?

Yes — the full prompt history is part of the co-design record. This includes initial context-setting prompts, per-experiment implementation prompts, architecture Q&A sessions, and the results capture written by the implementation assistant after each experiment. Prompt engineering is a first-class artifact here, not an implementation detail.

Where did AI assistance add the most leverage, and where did it fall short?

Based on completed modules to date:

High leverage: generating boilerplate RTL structure, writing directed testbenches, producing consistent documentation, and working through well-specified sub-problems quickly. The decoder track — a large but mechanically regular mapping from instruction encodings to decode packets — was completed faster with AI assistance than would have been practical manually.

Lower leverage, higher correction rate: anything involving subtle timing contracts, interface semantics requiring cross-module understanding, or decisions where the right answer depends on downstream microarchitectural context. The TAGE implementation sessions produced the most correction work — hallucinated interfaces, misapplied Seznec allocation rules, and incorrect CTR update logic all appeared and required expert correction. Each failure is documented in the results capture, which is where the real methodology value accumulates.

Future Plans

What future projects do you have in mind?

Pacino is a significant development effort in its own right, but the project also creates infrastructure for several adjacent research directions.

Benchmark contamination: The prompt documentation and evaluation process that Pacino produces may also address a real gap in LLM benchmarking. Existing benchmarks are increasingly contaminated by training data exposure. A complex, novel hardware design task exposing many possible solutions — with ground truth determined by simulation correctness and expert review — provides an evaluation surface that is difficult to game and largely absent from current benchmark suites.

Mutation testing: The co-design process also creates a natural foundation for AI-guided mutation testing of SystemVerilog. Architecture-aware mutation — where the mutation strategy is informed by microarchitectural knowledge rather than applied blindly at the source level — is an underexplored area in processor verification. Pacino's RTL and co-design record provide the substrate to develop and evaluate this.

Performance models: Parameterized RTL implies a parameterized performance model. Whether that means deep modification of an existing model or a ground-up implementation is an open question — but the ability to co-tune RTL and model parameters against a target workload set enables in-line optimization and is a prerequisite for automated design space exploration.

Design Space Exploration: The combination of parameterized RTL, a correlated performance model, and a benchmark suite with known workload characteristics creates the conditions for a design space exploration tool — one that can evaluate microarchitectural tradeoffs against a constrained application set with compiler, ISA, and microarchitecture all in the loop. That is the longer-term direction.

More Info

What can I find more info?
URL Notes
github.com/uarchlabs/pacino Pacino Repo
uarchlabs/repositories Other uarchlabs Projects
uarchlabs.com uarchlabs Main Web Site
uarchlabs.github.io uarchlabs GitHub Landing Page
pacino.github.io Pacino GitHub Landing Page

Licensing

What license is the RTL released under?

All RTL is released under the Apache 2.0 license unless explicitly stated otherwise in a given repository. You can use it in commercial and non-commercial projects, modify it, and redistribute it with attribution.

Can I use this in a commercial tape-out?

Yes, subject to the Apache 2.0 terms. We don't require revenue sharing or notification, though we appreciate hearing about derivative work — particularly if you've extended the methodology. As with any open-source RTL: perform your own verification before committing silicon. The published test coverage is a floor, not a guarantee.

About

What is uarchlabs?

uarchlabs is a research-first open source lab investigating how large language models can be used as active participants in microarchitecture design. We release synthesizable RTL alongside the full co-design record — prompts, iteration history, evaluation harnesses, and design rationale. The goal is reproducibility: anyone should be able to audit not just what we built, but exactly how.

Who is behind uarchlabs?

uarchlabs was started by a practicing architect to systematically understand where LLMs add genuine leverage in the RTL design loop — and where they don't, how this dynamic impacts team size, productivity and outcome.

The lab is designed from the start to be larger than any one contributor.

Is this a commercial product?

No. uarchlabs is a research lab, not a product or startup. Everything we publish is open source. There is no paid tier, no proprietary tooling being sold, and no IP being withheld.