少点错误 05月28日 12:07
If you're not sure how to sort a list or grid—seriate it!
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了Seriation,一种在数据分析中寻找对象合适线性顺序的方法。由于其组合性质,Seriation在处理大型数据集时具有挑战性,但仍有精确解法和启发式方法。文章重点介绍了R包`seriation`,它提供了一个用于Seriation的基础设施,包括数据结构、排序方法、损失函数计算和可视化技术。文章还探讨了Seriation在各种实际应用中的潜力,例如整理链接推荐、组织标签条目、图像排序等,并提供了使用LLM自动处理Seriation的案例。

🧩 Seriation是一种数据分析技术,用于寻找数据集的线性排序,即使没有明确的排序标准,也能通过最小化相邻项的差异来实现近似排序。

💡 R包`seriation`提供了Seriation的完整基础设施,包括表示线性顺序的数据结构、多种Seriation方法、计算损失函数的方法以及可视化技术。

🖼️ Seriation在多种场景中都有应用,例如在Gwern.net上组织相似链接推荐,通过对标签条目进行排序来提升信息的可读性,以及对照片等图像集合进行排序。

🤖 文章还提到了使用LLM(如GPT-4o)自动进行Seriation的可能性,可以用于整理笔记、优化文本列表等,通过LLM重新排序并清理数据,实现更好的组织结构。

Published on May 28, 2025 3:54 AM GMT

"Getting Things in Order: An Introduction to the R Package seriation", Hahsler et al 2008:

Seriation [or "ordination"], i.e., finding a suitable linear order for a set of objects given data and a loss or merit function, is a basic problem in data analysis. Caused by the problem's combinatorial nature, it is hard to solve for all but very small sets. Nevertheless, both exact solution methods and heuristics are available.

In this paper we present the package seriation which provides an infrastructure for seriation with R. The infrastructure comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation.

To illustrate how easily the package can be applied for a variety of applications, a comprehensive collection of examples is presented.


Have you ever wondered how to sort a list or a folder of files where no strict sorting comparison operator like 'newer than' is quite right? It turns out that it is perfectly possible to loosen the definition of 'sorting' to something more approximate like 'try to minimize how different each item is from the next one'; this approximate or generalized sorting is called 'seriation'. It is obscure (I had never heard of the term until a year ago or so), but highly useful: it works for everything from seriating Egyptian graves by rough burial time to organizing tag entries by topic. Since we now have neural embeddings for just about every modality there is, that means you can seriate anything.

What might you use it for? I use it on Gwern.net (background) to organize the 'similar links' recommendations in a way much smarter than the naive k-NN embedding distance retrieval approach. If you just sort them by 'distance', it is mostly meaningless and produces a jumble; if you seriate them, however, suddenly you see clear clusters/topics emerge out of the chaos, and it's easier to skim the list. Because it works so well, and is so simple to implement (simple greedy distance minimization, no need for TSP solvers), I initially called it "sort by magic". This could be used to seriate LW2 tag-entries by something more relevant than either just date or upvotes. It could also be used to present the tags themselves as a big 2D grid. (A tag embedding is the average of all its members' embeddings. Then you just seriate them as if they were normal tagged items.) You can do further nice tricks with it, like infer the number of clusters/topics by where distances spike in size, and label them automatically with a LLM. I sometimes seriate regular lists of text in my writing, where I can't come up with something meaningful.

Seriation also works beautifully for photographs or other big jumbles of images, where usually there is no way to 'sort' them that matters. (The filenames may be hash gibberish, and the file-created times arbitrary, or indicate nothing more than when you happened to download a file.)

I've also played around with an OA API LLM script for doing seriation on short natural language text lists, which could be used to automatically seriate lists in anything you write, or which could be used to clean up messy notes. (For example, you could seriate your chaotic incoherent notes, then tell a LLM to rewrite your notes strictly line by line, and then, with the organization and grammar/spelling all cleaned up, start working on it yourself.) It works but the data formatting issues are a bit tricky, so I think I may have to scrap my little prototype and start over with a JSON/'function-calling'-style approach, which I am not too familiar with. But it's not hard, a GPT-4o-scale LLM understands pretty well a prompt like "Reorder them to group similar items, but do NOT rename, add, or remove any items.", so you can easily make your own tool.


Keep this in mind the next time you see a list or a grid anywhere, where there's not an obviously correct way to sort it: it doesn't have to be sorted in a dumb way or left sorted at random, when it could be... seriated!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Seriation 数据分析 排序 R包
相关文章