少点错误 2024年12月15日
Write Good Enough Code, Quickly
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在ML研究中编写代码的原则与流程。提到在研究中,代码的使用周期和团队情况与常规软件工程有所不同,速度至关重要。文章还给出了一些编写足够好代码的原则,并介绍了ML研究的工作流程。

在ML研究中,项目周期短且协作人员少,速度至上

编写足够好的代码的原则,如最小化间接性等

ML研究工作流程,包括创建项目、安装依赖等

强调配置对象的重要性,将其作为笔记本的主要接口

介绍实验跟踪系统及在不同阶段的实验方式

Published on December 15, 2024 4:45 AM GMT

At the start of my Ph.D. 6 months ago, I was generally wedded to writing "good code". The kind of "good code" you learn in school and standard software engineering these days: object oriented, DRY, extensible, well-commented, and unit tested. I like writing "good code" - in undergrad I spent hours on relatively trivial assignments, continually refactoring to construct clean and intuitive abstractions. Part of programming's appeal is this kind of aesthetic sensibility - there's a deep pleasure in constructing a pseudo-platonic system of types, objects, and functions that all fit together to serve some end. More freedom than math, but more pragmatism than philosophy. This appreciation for the "art" of programming can align with more practical ends - beauty often coincides with utility. Expert programmers are attracted to Stripe because of their culture of craftsmanship, and Stripe promotes a culture of craftsmanship (in part, presumably) because "building multi-decadal abstractions" (in the words of Patrick Collison) is useful for the bottom line.

And this is all well and good, if (as is more often the case than not) you expect your code to be used in 1, 2, 10 years. But in research, this is often not the case! Projects typically last on the order of weeks to months, not years to decades. Moreover, projects typically involve a small number (often 1 or 0) of highly involved collaborators, as opposed to the large, fragmented teams typical of industry. Moreover, speed is paramount. Research is a series of bets, and you want to discover the outcome of the bet as fast as possible. Messy code might incur technical debt, but you don't have to pay if you scrap the entire project.

I had heard advice like this going into my PhD, both in the context of research and product development generally (MVP, Ballmer Peak, etc). It took me a while to internalize it though, in part, I suspect, because there's an art to writing "messy" code too. Writing error-prone spaghetti code is not the answer - you need stuff to work quickly to get results quickly. The goal is to write good enough code, efficiently, but learning what good enough means is a skill unto itself.

Principles for Good-Enough Code

Below is a first pass at some guiding principles. I focused on ML research in Python, but I suspect the lessons are generalizable

    Future-t you is the target user
      where t is something like an exponential distribution with median 1 day.
    Minimize indirection - Have as much of the code as is reasonable in a single notebook

      This is the kind of advice that’s horrible for freshman CS students, but probably helpful for first-year PhD students [1] Having everything in one place increases context - you can just read the program logic, without having to trace through various submodules and layers of abstraction. It also encourages you to constantly review code which otherwise might be tucked away, naturally helping you to catch errors, identify improvements, or notice additional axes of variation in the system.

    Only refactor when you need to - but always refactor when you need to
      When you’ve had the impulse two or three times to pull something out into a separate function or object, do it. As a default though, be very suspicious of coding activity that isn’t directly doing the thing.
    Use your context
      Lots of best practices in software engineering revolve around providing context and constraints to future readers and editors, in the form of comments, assertions, documentation, and type checks. Some of this will be useful for you 5 minutes after writing it, but a lot of it won't. So don't worry too much about using enums instead of strings, adding docstrings for each function, and avoiding magic numbers. Semantic naming and parse commenting are useful enough.
    Copy and paste is your friend
      In keeping with minimizing indirection, it's often better to reuse a component by copying and pasting rather than sharing it across functions/scripts. Not only does this improve context, but it also promotes decoupling. If you end up needing to modify the component for a particular usecase, you can do so without worrying about how the change will affect functionality elsewhere (the conserve of this is that if you want to make the same modification, you have to do it twice, so, as always, user discretion is required).
    You're still allowed to think - slow is smooth and smooth is fast
      When told to prioritize speed in coding, we often imagine the rogue hacker, wizzing away at a terminal , no time wasted without a keystroke. And sure, maybe 10x engineers operate something like this. But for mere mortals, it's important to remember that you can still do a bit of planning before getting to work. For me, planning usually takes the form of pseudo-code comments, but a little diagram sketching and rubbing ducking won't hurt either. The key is to efficiently execute an imperfect plan - and this requires having an imperfect plan to begin with.
    Avoid unit tests - at least early on
      The most obvious case of trading speed for reliability. In research, you should be constructing your code incrementally, running it at each step in a REPL or notebook. By the time you're done, you've basically covered the central use case (running the full experiment script), and don't have to worry about arbitrarily users exploiting weird edge cases. You are the target user. And running the script is often the only (integration) test you need (do check tensor shapes though, ML debugging is hard and all).
    Use an LLM
      This should be obvious. As of December 14th 2024, I'd recommend Curser with Sonnet 3.5 (though I occasionally use O1 to work through some math)

Again, all this advice assumes a baseline of "standard software engineering practices" - I want to help cure you of deontic commitments like never repeating yourself. But if you don't need curing in the first place, you should probably reverse this advice.

My ML Research Workflow

With these principles in mind, I'll walk through my current research workflow. My goal is to fluidly transition and forth from a rough experimental notebook to a full experiment pipeline with tracking, sweeps, and results visualization.

Mileage on this exact setup may vary, but thus far I’ve found it strikes a great balance between flexibility and efficiency. Most significantly, I've found my "ugh field" around moving from local experimental notebook to submitting cluster jobs has been substantially reduced.

Conclusion

So yeah, those are my tips and basic setup. Again, they apply most strongly to early stage research, and most weakly to developing large compressive pieces of infrastructure (including research infrastructure like PyTorch, Hugging Face, and Transformer-lens). In some sense, the core mistake is to assume that early stage research requires novel extensive research infrastructure[2]. Developing open source infrastructure is, to a first approximation[3] prosocial: the gains are largely born by other users. So by all means, develop nice open-source frameworks - the world will benefit from you. But if you have new research ideas that you're eager to try out, the best approach is often to just try them ASAP.

Related Articles/ Sources of Inspiration

  1. ^

     I was initially shocked by how “messy” this GPT training script was - now I think it's the Way

  2. ^

     This meme has been propagated to a certain extent by big labs, who make the (true) point that (infrastructure produced by) research engineers dramatically accelerates research progress. But this can simultaneously be true while it also being the case that for a small research time with limited budget, myopically pursing results is a better bet

  3. ^

     Reputational gains aside



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ML研究 代码编写 工作流程 配置对象 实验跟踪
相关文章