Journal · March 6, 2026

What is the IPIP-NEO-120, and why we use it

Everything about the public-domain Big Five personality test that powers Are We Friends? — its history, validation, and what makes it different.

If you signed up for Are We Friends? and had to take a 120-item personality test, you took the IPIP-NEO-120. This page explains what it is, where it came from, and why we picked it over every other Big Five instrument out there.

If you'd rather just take it: the free standalone version is here.

The short version

The IPIP-NEO-120 is a 120-item public-domain Big Five personality test, developed by John A. Johnson based on Lewis R. Goldberg's International Personality Item Pool (IPIP). It scores all five Big Five domains plus their 30 underlying facets. It takes ~15 minutes. It has hundreds of academic citations. It's free for anyone to administer.

It's the gold standard for cheap, research-grade personality scoring. We use it because it's the best instrument that meets three constraints: real psychometric validity, full 30-facet resolution, and reasonable testing time.

The history

Modern personality psychology converged on the five-factor model (Big Five) through three independent research traditions in the 1980s. The dominant commercial instrument that emerged was the NEO-PI-R (Costa & McCrae's Revised NEO Personality Inventory), a 240-item proprietary test sold through Psychological Assessment Resources.

The NEO-PI-R is excellent — but expensive, slow (~45 minutes), and tightly licensed. You can't use it in a free product. You can't even reproduce a subset of items in a paper without paying.

In 1996, Goldberg started publishing public-domain personality items — questions worded to capture the same constructs as the NEO-PI-R but with no copyright restrictions. This grew into the International Personality Item Pool (IPIP), now hosting thousands of items across hundreds of constructs at ipip.ori.org. The original 300-item IPIP-NEO was a public-domain analog to the NEO-PI-R.

Johnson's contribution was to take the IPIP-NEO-300 and produce a 120-item version that retained the 30-facet resolution while cutting testing time to a quarter. Each facet is measured by exactly 4 items, balanced for keying direction (some positively keyed, some reverse-keyed) to detect random-clicking and consistent-response bias. The result was published in 2014 (Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120, Journal of Research in Personality, 51, 78-89).

The shorter test was meant for situations where 45 minutes is too long but a true Big Five profile (not just five domain scores) is still needed. Friendship matching is exactly that situation.

Why 120 items, not 50 or 240

A reasonable question. Why not just use a 50-item Big Five test? Why bother with 120?

The answer is per-facet item count. Personality scoring works by averaging multiple items into a facet score, then averaging facets into a domain score. The reliability of any facet depends on how many items measure it.

50-item Big Five tests: 10 items per domain, 0 facets. You get the five domain scores and that's it.
120-item IPIP-NEO-120: 4 items per facet, 24 items per domain. You get all 30 facets at decent reliability.
300-item IPIP-NEO-300 / NEO-PI-R: 10 items per facet. Higher reliability, much longer test.

The 4-items-per-facet floor is the minimum where facet scores are stable enough to use for matching. Below 4 items per facet, your facet scores wobble between sittings — a friend who took the same test twice on different days would get notably different facet scores. That's noise, not signal, and you can't rank-match on noise.

Above 4 items per facet, the marginal precision gain per added item is small. The 300-item version is more accurate but the gain per added item is sub-linear. For our use case — friendship matching, not clinical assessment — 120 items at 4 per facet hits the sweet spot.

What the 30 facets are

Each Big Five domain has six underlying facets:

Openness to Experience: Imagination, Artistic Interests, Emotionality, Adventurousness, Intellect, Liberalism

Conscientiousness: Self-Efficacy, Orderliness, Dutifulness, Achievement-Striving, Self-Discipline, Cautiousness

Extraversion: Friendliness, Gregariousness, Assertiveness, Activity Level, Excitement-Seeking, Cheerfulness

Agreeableness: Trust, Morality, Altruism, Cooperation, Modesty, Sympathy

Neuroticism: Anxiety, Anger, Depression, Self-Consciousness, Immoderation, Vulnerability

Two people with identical Big Five domain scores can have wildly different facet patterns. The facets are where the actual signal for friendship matching lives.

The validation status

A few specific things you'd want to know about psychometric validity:

Test-retest reliability. Johnson reports test-retest correlations in the 0.80-0.90 range across a several-week interval. That means people who take the test twice get similar scores — the test is measuring something stable, not a fluctuating mood.

Internal consistency. Cronbach's alpha for the domain scales is in the 0.85-0.92 range, comparable to the longer NEO-PI-R. For facets, alpha is in the 0.65-0.85 range — adequate at the 4-item-per-facet level but lower than the 10-item facets in the NEO-PI-R.

Convergent validity. Scores on the IPIP-NEO-120 correlate strongly (r ~0.85-0.92) with scores on the NEO-PI-R for the same individual. They're measuring the same constructs.

Use in published research. Hundreds of peer-reviewed papers have used the IPIP-NEO-120 since publication. It's a workhorse instrument in research contexts where the NEO-PI-R is too expensive to license at scale.

This is the difference between a research-validated psychometric instrument and a Buzzfeed quiz. A Buzzfeed quiz has none of these properties. The IPIP-NEO-120 has all of them.

How we administer it

A few things that matter about how Are We Friends? handles the test:

All 120 items are required. We don't let users skip a shorter version because partial tests produce too much per-facet noise to rank on. Yes, 15 minutes is a long onboarding step. The matching depth depends on it.

Items are randomized within facet to prevent lazy-clicking patterns. The same facet's items are spread across the test rather than presented in clumps, so consistent-response bias (rating everything the same) is detectable.

Reverse-keyed items are clearly indicated only at scoring time, not at presentation. Users see the items in their natural form. We handle the keying invisibly. This avoids the "wait, am I supposed to flip my answer here?" confusion.

The test is auto-saved between sessions. You can stop at item 47, close the tab, come back tomorrow, and continue where you left off. We don't have a single user who takes the full 15 minutes in one sitting — most people do it across 2-3 sittings over a day or two.

The full 30-facet result is shown to the user. Some products only show domain scores; we show every facet. We think this is good for the test-taker (more useful self-knowledge) and good for the matching (members can see both their own profile and the matches' profiles in detail).

Why not the longer version?

Tradeoffs.

The IPIP-NEO-300 is more reliable per facet (10 items each instead of 4). If we cared only about psychometric perfection, we'd use it. But:

45 minutes is a brutal onboarding step. Drop-off rates would be much higher.
The marginal reliability gain doesn't translate to better matches at the resolution we care about. Once a facet is reliable enough to rank on, more reliability doesn't move the matching outcome much.
Updating your profile (re-taking the test as you change) becomes a 45-minute decision instead of a 15-minute one. We want users to feel comfortable re-taking the test annually.

The 120-item version is the right balance for our use case.

Why not MBTI?

This comes up a lot. MBTI is the most-known personality framework in popular culture. Why don't we use it?

A few reasons, each significant:

MBTI's test-retest reliability is poor. Roughly 50% of people get a different 4-letter type when they re-take the test. That's because MBTI forces continuous traits into binary categories — if you score 51% on Introversion you're an "I," if 49% you're an "E," and the cutoff is arbitrary. The IPIP-NEO-120 reports continuous percentile scores, so 51% and 49% are nearly identical and stable.
MBTI's four dimensions aren't truly independent. Factor analyses on MBTI data don't recover the four hypothesized factors cleanly. The Big Five dimensions, by contrast, were derived FROM factor analysis and are by construction maximally independent.
MBTI's empirical validity for predicting outcomes is weak. Few rigorous studies show MBTI types predicting much of practical importance. The Big Five domains predict job performance, marital stability, longevity, mental health, and many other outcomes at moderate effect sizes.
MBTI is proprietary. The official test is owned by The Myers-Briggs Company and licensed expensively. Public-domain Big Five instruments like IPIP-NEO are openly available.

If you came in knowing your MBTI type, the rough mapping to Big Five is: I/E ≈ Extraversion (inverted), N/S ≈ Openness, T/F ≈ Agreeableness, J/P ≈ Conscientiousness. The full bridge is at /big-five-vs-mbti.

How to read your scores

If you've taken the test, your scores will be presented as percentile ranks against the IPIP normative sample. A 70 on Openness means you scored higher than 70% of the people in the comparison sample.

We have a full guide to interpreting your scores here. The short version: 50 is exactly average, 60-79 is moderately distinctive, 80+ is genuinely distinctive (top 1-in-5), 95+ is rare and worth taking seriously as a defining feature of your personality.

The bottom line

The IPIP-NEO-120 is the public-domain 120-item Big Five test that powers Are We Friends? matching. It's faster than the gold-standard 300-item version while retaining the 30-facet resolution that makes facet-level matching possible. It's research-validated, free to administer, and academically respected.

If you haven't taken it yet, the free standalone version is here. If you want to use the result for friend matching, sign up at arewefriends.org. The whole thing exists because there wasn't a friendship app built on real personality science yet, and we thought there should be.

Keep reading

← Newer

The Phoenix friendship map — where adults actually meet friends

← All journal posts

Give it five minutes

Meet people who actually fit — and do something real.

Free tier, Big Five test, three matches nearby. No ads, ever.

Start — Free →Take the tour →