MIT Technology Review » Artificial Intelligence 2024年12月04日
The startup trying to turn the web into a database
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Exa 是一家初创公司,致力于开发一种基于大型语言模型的新型搜索引擎 Websets。与传统搜索引擎不同,Exa 将网页内容编码成嵌入向量,利用大型语言模型预测相关链接,从而提供更精准、更符合用户需求的结果。Websets 尤其适用于寻找特定类型的人或公司等复杂查询,例如“制造未来硬件的初创公司”。虽然目前速度较慢,但 Exa 的方法有望彻底改变搜索方式,并为投资者、招聘人员等提供更强大的数据获取能力。

🤔Exa 使用大型语言模型将网页内容编码成嵌入向量,而不是像传统搜索引擎那样索引关键词,从而能够理解网页的语义,而不是仅仅匹配关键词。

🚀Exa 的搜索引擎通过预测下一个链接来返回结果,而不是像传统搜索引擎那样匹配关键词。例如,搜索“制造未来硬件的初创公司”,Exa 可以返回相关的公司列表,而不是网页链接。

⏳Exa 的搜索速度较慢,需要几分钟才能返回结果,但可以返回数千甚至数万条结果,适合需要大量数据的用户。

💡Exa 的搜索引擎在处理复杂查询方面表现出色,例如寻找特定类型的人或公司,而传统搜索引擎在处理此类查询时效果不佳。

⚠️Exa 目前在处理简单的信息查询方面不如传统搜索引擎,例如查找人物关系或个人信息。Exa 正在努力改进嵌入方法,以解决这个问题。

A startup called Exa is pitching a new spin on generative search. It uses the tech behind large language models to return lists of results that it claims are more on point than those from its rivals, including Google and OpenAI. The aim is to turn the internet’s vast and chaotic tangle of web pages into a lookup table, where queries return specific, precise results.

Exa already provides its search engine as a back-end service to companies that want to build their own applications on top of it. Today it is launching the first consumer version of that search engine, called Websets.  

“The web is a collection of data, but it’s a mess,” says Exa cofounder and CEO Will Bryk. “There’s a Joe Rogan video over here, an Atlantic article over there. There’s no organization. But the dream is for the web to feel like a database.”

Websets is aimed at power users who need to look for things that other search engines aren’t great at finding, such as types of people or companies. Ask it for “startups making futuristic hardware” and you get a list of specific companies hundreds long rather than hit-or-miss links to web pages that mention those terms. Google can’t do that, says Bryk: “There’s a lot of valuable use cases for investors or recruiters or really anyone who wants any sort of data set from the web.”

Things have moved fast since MIT Technology Review broke the news in 2021 that Google researchers were exploring the use of large language models in a new kind of search engine. The idea soon attracted fierce critics. But tech companies took little notice. Three years on, giants like Google and Microsoft jostle with a raft of buzzy newcomers like Perplexity and OpenAI, which launched ChatGPT Search in October, for a piece of this hot new trend.

Exa isn’t (yet) trying to out-do any of those companies. Instead, it’s proposing something new. Most other search firms wrap large language models around existing search engines, using the models to analyze a user’s query and then summarize the results. But the search engines themselves haven’t changed much. Perplexity still directs its queries to Google Search or Bing, for example. Think of today’s AI search engines as sandwiches with fresh bread but stale filling.

Exa provides users with familiar lists of links but uses the tech behind large language models to reinvent how search itself is done. Here’s the basic idea: Google works by crawling the web and building a vast index of keywords that then get matched to users’ queries. Exa crawls the web and encodes the contents of web pages into a format known as embeddings, which can be processed by large language models.

Embeddings turn words into numbers in such a way that words with similar meanings become numbers with similar values. In effect, this lets Exa capture the meaning of text on web pages, not just the keywords.

A screenshot of Websets showing results for the search: “companies; startups; US-based; healthcare focus; technical co-founder”

Large language models use embeddings to predict the next words in a sentence. Exa’s search engine predicts the next link. Type “startups making futuristic hardware” and the model will come up with (real) links that might follow that phrase.

Exa’s approach comes at cost, however. Encoding pages rather than indexing keywords is slow and expensive. Exa has encoded some billion web pages, says Bryk. That’s tiny next to Google, which has indexed around a trillion. But Bryk doesn’t see this as a problem: “You don’t have to embed the whole web to be useful,” he says. (Fun fact: “exa” means a 1 followed by 18 0s and “googol” means a 1 followed by 100 0s.)

Websets is very slow at returning results. A search can sometimes take several minutes. But Bryk claims it’s worth it. “A lot of our customers started to ask for, like, thousands of results, or tens of thousands,” he says. “And they were okay with going to get a cup of coffee and coming back to a huge list.”

“I find Exa most useful when I don’t know exactly what I’m looking for,” says Andrew Gao, a computer science student at Stanford Univesrsity who has used the search engine. “For instance, the query ‘an interesting blog post on LLMs in finance’ works better on Exa than Perplexity.” But they’re good at different things, he says: “I use both for different purposes.”

“I think embeddings are a great way to represent entities like real-world people, places, and things,” says Mike Tung, CEO of Diffbot, a company using knowledge graphs to build yet another kind of search engine. But he notes that you lose a lot of information if you try to embed whole sentences or pages of text: “Representing War and Peace as a single embedding would lose nearly all of the specific events that happened in that story, leaving just a general sense of its genre and period.”

Bryk acknowledges that Exa is a work in progress. He points to other limitations, too. Exa is not as good as rival search engines if you just want to look up a single piece of information, such as the name of Taylor Swift’s boyfriend or who Will Bryk is: “It’ll give a lot of Polish-sounding people, because my last name is Polish and embeddings are bad at matching exact keywords,” he says.

For now Exa gets around this by throwing keywords back into the mix when they’re needed. But Bryk is bullish: “We’re covering up the gaps in the embedding method until the embedding method gets so good that we don’t need to cover up the gaps.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

搜索引擎 大型语言模型 嵌入向量 Websets Exa
相关文章