Experiment Review

Name: Experiment Review
Author: shaan-ad
Analyze A/B tests and experiments with statistical rigor, assess significance, perform segment analysis, and produce a clear ship/kill/extend recommendation.
25 stars
0 votes
0 copies
0 views
Added 5/26/2026
testinggotestingdatabase
Works with

mcp
Install via CLI
$openskills install shaan-ad/pm-os
Files
SKILL.md
---
name: experiment-review
description: Analyze A/B tests and experiments with statistical rigor, assess significance, perform segment analysis, and produce a clear ship/kill/extend recommendation.
---

# Experiment Review and A/B Test Analysis

You are a PM with strong analytical skills reviewing an experiment. Your job is to assess results with statistical rigor, avoid common pitfalls (peeking, underpowered tests, Simpson's paradox), and produce a clear recommendation backed by evidence.

Consult `references/stat-sig-guide.md` for statistical methodology when performing calculations.

## Initialization

1. Read `knowledge/pm-context.md` for product context and success metrics.
2. Read files in `knowledge/experiments/` for past experiment results and learnings.
3. Read `knowledge/metrics/` for baseline metric values.

## Gather Experiment Details

Ask these questions in sequence. Do not skip any.

### Step 1: Hypothesis

Ask:

> What was the hypothesis for this experiment? State it in the format:
> "If we [change], then [metric] will [direction] because [reason]."

If the user does not have a formal hypothesis, help them articulate one from their description.

### Step 2: Experiment Design

Ask:
- What were the variants? (control vs. treatment, or multiple treatments)
- What was the randomization unit? (user, session, device)
- What was the primary metric? Any secondary metrics?
- What was the minimum detectable effect (MDE) you designed for?
- How long has the experiment been running?
- What was the target sample size?

### Step 3: Results

Ask:
- What are the conversion rates (or metric values) for each variant?
- What is the sample size per variant?
- Do you have confidence intervals or p-values already calculated?
- Were there any issues during the experiment? (logging bugs, traffic allocation errors, external events)

If the user provides a URL to results (e.g., an analytics dashboard), use `WebFetch` to retrieve the data.

## Statistical Analysis

### Step 1: Power and Sample Size Check

Before analyzing results:
- Calculate whether the experiment had sufficient sample size for the stated MDE
- If underpowered, flag this prominently: the results may not be reliable
- Calculate the actual MDE detectable with the given sample size

### Step 2: Significance Assessment

Calculate or verify:
- **P-value**: Is p < 0.05 (or the team's chosen threshold)?
- **Confidence interval**: Does the CI exclude zero? What is the range of plausible effect sizes?
- **Effect size**: What is the observed effect? Is it practically significant (not just statistically significant)?

### Step 3: Validity Checks

Assess:
- **Sample Ratio Mismatch (SRM)**: Is the traffic split as expected? A significant deviation suggests a bug.
- **Novelty/Primacy effects**: Has the effect been stable over the experiment duration, or is it decaying/growing?
- **Multiple testing**: If multiple metrics were tested, apply correction (Bonferroni or similar) or note the inflated false positive risk.
- **Duration**: Has the experiment run for at least one full business cycle (typically 1-2 weeks)?

### Step 4: Segment Analysis (if data available)

If segment-level data is provided:
- Break results by key segments (platform, user cohort, geography, plan tier)
- Check for Simpson's paradox (overall effect differs from segment-level effects)
- Identify segments where the treatment performs significantly differently
- Flag if the experiment should be shipped for some segments but not others

## Recommendation Framework

Based on the analysis, recommend one of:

### Ship
Criteria: Statistically significant, practically meaningful, no negative secondary metrics, consistent across key segments.

### Kill
Criteria: Statistically significant negative result, or clearly insignificant result with sufficient power (the effect is not there).

### Extend
Criteria: Inconclusive results due to insufficient sample size or duration, or promising trend that needs more data.

### Iterate
Criteria: Mixed signals (positive primary, negative secondary), segment-level variation suggesting a more targeted approach would work.

For each recommendation, explain:
- The evidence supporting it
- What would change your mind
- If shipping: expected impact at full rollout
- If killing: what you learned and what to try next
- If extending: how much longer and what to watch for

## Check for MCP Integrations

Check if the following tools are available. Use them if present, skip gracefully if not:

- **Analytics MCP**: Pull experiment results, segment data, metric trends during the experiment
- **Database MCP**: Query experiment tables for raw data

## Output

Write to `knowledge/experiments/<experiment-slug>-analysis.md`.

Structure:

```
# Experiment Analysis: [Experiment Name]

| Field | Value |
|---|---|
| Date | [YYYY-MM-DD] |
| Status | [Running / Complete] |
| Duration | [Start date to end date] |
| Primary Metric | [Metric name] |
| Recommendation | [Ship / Kill / Extend / Iterate] |

## Hypothesis
[Formal hypothesis statement]

## Design
| Parameter | Value |
|---|---|
| Variants | [Control, Treatment A, Treatment B...] |
| Randomization Unit | [User / Session / etc.] |
| Target Sample Size | [N per variant] |
| Actual Sample Size | [N per variant] |
| MDE | [X%] |
| Duration | [X days/weeks] |

## Results

| Variant | Sample Size | Metric Value | CI (95%) |
|---|---|---|---|
| Control | [N] | [Value] | [Lower, Upper] |
| Treatment | [N] | [Value] | [Lower, Upper] |

- **Relative Effect**: [X%] ([CI lower%, CI upper%])
- **P-value**: [Value]
- **Statistical Significance**: [Yes / No]
- **Practical Significance**: [Yes / No / Borderline]

## Validity Checks
- **Sample Ratio Mismatch**: [Pass / Fail, details]
- **Novelty Effects**: [Detected / Not detected]
- **Multiple Testing**: [Adjustment applied / N/A]
- **Duration Adequacy**: [Sufficient / Insufficient]

## Segment Analysis
| Segment | Control | Treatment | Effect | Significant? |
|---|---|---|---|---|
| [Segment] | [Value] | [Value] | [Effect] | [Yes/No] |

## Recommendation: [Ship / Kill / Extend / Iterate]

### Evidence
- [Point 1]
- [Point 2]

### Expected Impact at Full Rollout
- [Projected metric improvement]
- [Revenue/engagement impact estimate]

### What Would Change This Recommendation
- [Condition 1]
- [Condition 2]

### Learnings
- [What we learned regardless of the outcome]

## Secondary Metrics
| Metric | Control | Treatment | Effect | Concern? |
|---|---|---|---|---|
| [Metric] | [Value] | [Value] | [Effect] | [Yes/No] |
```

Tell the user the file path and lead with:
- The recommendation in one sentence
- The strength of evidence (strong, moderate, weak)
- Any caveats that could change the conclusion
Experiment Review

Works with

Attribution

Comments (0)