Our ML-powered startup discovery pipeline

In late 2022, Pioneer started reaching out to startups we discovered by running our own models on every website on the Internet. We scraped 100M domains per week and used ML to filter them down to the top 0.002% for human review.

Founders were often stunned we found them:

“It literally feels like black magic that you guys found [company]”

“How did you find us? We are still in stealth :)”

“Curious to know how you found my info & that I'm working on [company]. Genius.”

“Out of curiosity, how did you find [company] and link it to me? I haven't publicized it much yet so color me impressed. 🙂”

“I'm curious, how did you hear about [company]? We haven't officially launched yet, I'm impressed that you've heard about it.”

“Just curious: how did you find [company]? It hasn't been announced publicly yet.”

“Do you mind sharing how you found [company]? I'm curious because I'm keeping it in stealth mode, and never shared it in public.”

“This was a surprise for me, we’ve never promoted [company] outside of our close network (yet). How did you find us?”

While many of you are familiar with the Pioneer Tournament, you’ve likely never heard of this piece of software—something we called Dreamlifter—which we were focused on internally for the past 3 years.

Dreamlifter started with a simple observation: one of the very first things technical founders often do—even before incorporating, opening a bank account, or quitting their day job—is put up a landing page. If somehow we could surface these websites right after they launched, we’d be able to find, fund, and support the best founders just as they were getting started.

While a very difficult problem to reason with (more on that below), a solution started taking shape. We’d continually scrape the entire internet, run machine learning models to find any interesting startup projects, and reach out to the people behind them. It wasn’t obvious that this experiment would work, but it did! Over 80% of our investments in 2023 were companies surfaced by this software.

Why is this problem so hard? A few reasons:

Lots of great startups initially have unimpressive landing pages. If you train your models on companies that have already raised money, you’re too late.
Because of this, we had to build a proprietary dataset of manually-labeled websites. Unfortunately, the question of what makes an early-stage startup website “good” is not clear cut, and indeed even changes over time—2 years ago, an AI writing assistant was a uniquely interesting idea! Label ambiguity is a serious issue for model performance, and therefore QA, data cleaning, and recalibration all had to be performed on a regular basis. Systematizing this type of workflow isn’t easy; just ask Andrej Karpathy.
“Needle in a haystack” problems are notoriously difficult in ML, and this problem in particular has a benchmark-obliterating class imbalance: less than 0.001% of websites are interesting early-stage startups.
Model performance is largely a function of parameter size, but so is compute cost. Running a very large model on hundreds of millions of scraped websites every day would be prohibitively expensive.

How did we solve these issues? There were a few particular insights that made it work:

A multi-stage model pipeline made it possible to run things performantly and cost-effectively. We first ran a fast, recall-focused model to filter out over 90% of websites, then ran several much larger & more computationally-expensive precision-focused models to filter, rank, and autolabel the remainder. While the first model looked at just website text, later models could incorporate a wider selection of features including DNS records, grammar mistakes, sitemaps and external links, the presence of various libraries, etc.
We framed the problem as a multi-class classification problem, with labels ranging from “not a startup” to “early-stage startup” + an interest level. This gave us the flexibility to either make multi-class predictions or reduce to a binary classification problem depending on the specific model in the pipeline.
While noisy data (as a result of label ambiguity) is still a problem worth remediating, we found that deep learning is actually rather robust to noise, so long as the evaluation dataset is error-free. So instead of optimizing training data quality, we focused our efforts towards curating noise-free splits to guide hyperparameters (validation set), confidence thresholds (calibration set) and real-world performance (evaluation set).
Given the continual concept drift of what constitutes an “interesting” startup, we developed an iterative workflow involving labeling model output → using the new data to train a better model → labeling the better model’s output, and so on.
To overcome issues arising from high class imbalance and low-confidence predictions, we leaned heavily on techniques like uncertainty estimation, probability calibration and conformal prediction to identify good startups with high certainty.

As a result of this work, we built what we believe is the best software in the world for discovering promising new startup websites to invest in. If we were running today’s pipeline retroactively, an analyst on our team would have seen about 60% of in-thesis YC S23 companies[1] before the batch even began (and, were we still working on it, there was a path to getting that number as high as 70 - 75%).

Of course, this is only a small piece of the seed-stage sourcing puzzle; many investors, for example, opt for a LinkedIn-oriented approach, where they search for founders based on their credentials. We landed on our particular strategy because we were galvanized by the idea of betting on founders regardless of their backgrounds.

[1] “in-thesis” excludes companies in industries we haven’t recently invested in, like e-commerce, capital-intensive biotech & healthcare, and more. It also excludes companies that don’t have websites before the batch begins.