← Home

Our ML-powered startup discovery pipeline

In late 2022, Pioneer started reaching out to startups we discovered by running our own models on every website on the Internet. We scraped 100M domains per week and used ML to filter them down to the top 0.002% for human review.

Founders were often stunned we found them:

"It literally feels like black magic that you guys found [company]"
"How did you find us? We are still in stealth :)"
"Curious to know how you found my info & that I'm working on [company]. Genius."
"Out of curiosity, how did you find [company] and link it to me? I haven't publicized it much yet so color me impressed. 🙂"
"I'm curious, how did you hear about [company]? We haven't officially launched yet, I'm impressed that you've heard about it."
"Just curious: how did you find [company]? It hasn't been announced publicly yet."
"Do you mind sharing how you found [company]? I'm curious because I'm keeping it in stealth mode, and never shared it in public."
"This was a surprise for me, we've never promoted [company] outside of our close network (yet). How did you find us?"

While many of you are familiar with the Pioneer Tournament, you've likely never heard of this piece of software—something we called Dreamlifter—which we were focused on internally for the past 3 years.


Dreamlifter started with a simple observation: one of the very first things technical founders often do—even before incorporating, opening a bank account, or quitting their day job—is put up a landing page. If somehow we could surface these websites right after they launched, we'd be able to find, fund, and support the best founders just as they were getting started.

While a very difficult problem to reason with (more on that below), a solution started taking shape. We'd continually scrape the entire internet, run machine learning models to find any interesting startup projects, and reach out to the people behind them. It wasn't obvious that this experiment would work, but it did! Over 80% of our investments in 2023 were companies surfaced by this software.

Why is this problem so hard? A few reasons:

How did we solve these issues? There were a few particular insights that made it work:

As a result of this work, we built what we believe is the best software in the world for discovering promising new startup websites to invest in. If we were running today's pipeline retroactively, an analyst on our team would have seen about 60% of in-thesis YC S23 companies[1] before the batch even began (and, were we still working on it, there was a path to getting that number as high as 70 - 75%).

Of course, this is only a small piece of the seed-stage sourcing puzzle; many investors, for example, opt for a LinkedIn-oriented approach, where they search for founders based on their credentials. We landed on our particular strategy because we were galvanized by the idea of betting on founders regardless of their backgrounds.


[1] "in-thesis" excludes companies in industries we haven't recently invested in, like e-commerce, capital-intensive biotech & healthcare, and more. It also excludes companies that don't have websites before the batch begins.