For Investment TeamsPrivate SWE Bench

Turn Portfolio Code Into a Private, Licenseable Engineering Benchmark

Repo Evaluator finds high-signal engineering work in a target company's repository history, scores benchmark readiness, and helps you decide what codebases are worth licensing for a company-specific SWE-bench style dataset.

Back to app

Plain-English Flow

This is the non-technical version you can share with PE stakeholders.

flowchart LR
  A["1) You select target repositories"] --> B["2) Repo Evaluator reads code + PR history"]
  B --> C["3) It filters for high-signal engineering work"]
  C --> D["4) It scores each repo for benchmark readiness"]
  D --> E["5) It outputs candidate SWE-bench style tasks"]
  E --> F["6) Your team licenses approved codebases"]

Scope

You pick which portfolio repositories are in scope for diligence and licensing review.

Quality Signal

We evaluate test quality, CI discipline, and PR patterns to isolate meaningful engineering changes.

Benchmark Curation

Accepted historical PRs become candidate tasks for a company-specific private SWE bench.

Decision Support

You receive a score, recommendation, and supporting evidence to guide licensing decisions.

What You Get

1. A benchmark readiness score for each repository.

2. A filtered set of candidate tasks based on real PR history.

3. Evidence you can use in licensing and diligence discussions.

Data Handling Guardrails

Repository clones are created in temporary working directories during evaluation.

Temporary cloned code is deleted at the end of each run.

Current web app stores final results in browser session storage for viewing.

Access is token-based and scoped to repositories the user can already access.

Technical Appendix (for engineering teams)

flowchart TD
  U["User"] --> SEL["/select<br/>Pick repositories"]
  SEL --> DASH["/evaluate<br/>EvaluationDashboard"]
  DASH --> API["POST /api/evaluate/stream"]

  API --> AUTH["Read gh_token from cookie"]
  AUTH --> MODAL["Forward request to MODAL_EVALUATE_URL"]

  MODAL --> WEB["Modal web_app()<br/>FastAPI streaming endpoint"]
  WEB --> PART["Create stream partition (UUID)"]
  PART --> SPAWN["Spawn evaluate_single_repo() per repo"]

  SPAWN --> CLONE["clone_repo()"]
  CLONE --> EVAL["RepoEvaluator.evaluate()"]

  EVAL --> RM["RepoAnalyzer.analyze()<br/>repo quality score"]
  EVAL --> PR["PRAnalyzer.analyze_prs()<br/>accept/reject merged PRs"]
  PR --> F2P["Optional _run_f2p_analysis()<br/>validate F2P/P2P tests"]
  RM --> SCORE["overall = 0.6 * repo + 0.4 * PR acceptance"]
  PR --> SCORE
  F2P --> SCORE
  SCORE --> RESULT["to_json() + snake_to_camel + derive_verdict"]

  SPAWN --> QUEUE["Modal Queue partition"]
  QUEUE --> WEB
  SPAWN --> LOGS["Emit log/progress/result events"]
  LOGS --> QUEUE
  RESULT --> QUEUE

  WEB --> SSE["Emit SSE: log | progress | result | complete"]
  SSE --> API
  API --> DASH

  DASH --> PARSE["parseSSEStream()"]
  PARSE --> UI["Update per-repo logs, phase, progress"]
  PARSE --> STORE["saveResults() in sessionStorage"]
  STORE --> RES["/results and /results/[owner]/[repo]"]

Scoring model:0.6 * repo_quality_score + 0.4 * pr_acceptance_rate_percent