少点错误 07月23日 11:47
TT Self Study Journal # 3
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者回顾了其第二冲刺(Sprint)的研究进展,主要聚焦于AI对齐(AI Alignment, AIA)领域,特别是机制可解释性(Mechanistic Interpretability, MI)和结果影响系统(Outcome Influencing System, OIS)的定义研究。作者在写作OIS文章、阅读Vannessa Kosoy's Learning Theoretic Agenda(VK LTA)以及初步规划数学学习和工具开发方面取得了一些进展,但也承认在许多日子里未能有效执行计划。文章详细记录了每日工作日志,并对未能完成的目标进行了反思,同时提出了第三冲刺的新目标,强调了寻求反馈和构建“易于反馈”的工作的重要性。

🎯 **OIS文章进展与反思**:作者在第二冲刺中对OIS(Outcome Influencing System)文章的定义部分进行了研究,尤其是在“系统”的定义上有了初步的理解。尽管花费了时间,但作者认为应优先完善OIS的定义,再深入研究其他领域的术语映射。同时,作者也意识到OIS的完成需要大量时间,并计划在下一冲刺中转向AI对齐领域的术语综述,以检验当前术语的混乱程度。

📚 **VK LTA阅读与思考**:作者投入时间阅读了VK LTA(Vannessa Kosoy's Learning Theoretic Agenda),但反思了未能采取主动笔记的阅读方式。初步的思考集中在价值本体(Value Ontology)和计算资源约束的频率论保证(Computational Resource Constraints and Frequentist Guarantees)上,并将其与Sutton的“痛苦的教训”(Sutton's Bitter Lesson)联系起来,表达了对语义空间和偏好规范的更大兴趣。

📧 **数学研究规划与推进**:虽然未能在第二冲刺中直接学习数学,但作者开始起草给大学数学教授的邮件,旨在讨论研究兴趣和数学学习计划,这一过程意外地引出了对研究方向和相关数学知识的梳理,作者认为这是一项有益的工作,并将其列为下一冲刺的写作任务。

🛠️ **工具开发与文献回顾计划**:作者未能启动“Transformers From Scratch”的学习以及机制可解释性(MI)工具和高维数据可视化方面的文献综述。这些任务已被顺延至第三冲刺,并列出了TransformerLens、LIT、BERTViz等一系列值得关注的MI工具。

💡 **社交与反馈策略**:作者在第三冲刺目标中新增了“社交”一项,强调获取反馈的重要性。这包括让工作“易于反馈”,例如通过快速实验或方便的文档“地图”来呈现研究内容,以便同行和导师能够快速理解并提供意见。此外,作者还计划通过阅读不同观点、在论坛发帖等方式来寻求反馈,并考虑未来职业发展和资金支持的策略。

Published on July 23, 2025 3:46 AM GMT

[Epistemic Status: This is an artifact of my self study. I am using it to remember links and help manage my focus. As such, I don't expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say "good work/good luck". I'm hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me. ]

Previous Entry: SSJ #2

List of acronyms: Mechanistic Interpretability (MI), AI Alignment (AIA), Outcome Influencing System (OIS), n-Dimensional Scatter Plot (NDSP), Vannessa Kosoy's Learning Theoretic Agenda (VK LTA), Machine Learning (ML), Large Language Model (LLM),

Review of 2nd Sprint

My goals for this sprint were:

So how did I do?

Daily Worklog

Tu, July 8Spent about 4 or 5 hours writing SSJ #2 and then started the document for SSJ #3. About 2 hours of that time was spent writing the section on Neel's MI guide transcribing from my handwritten notes. The other 2 hours was split between everything else.
Wd, July 9No progress. Woke early to go jogging, but didn't get enough sleep so ended up tired and distracted and eventually napped instead of working on this.
Th, July 10SSJ--2. Spent about an hour reading VK LTA while on the bus.
Fr, July 11SSJ--2. Spent about 2 hours reading VK LTA.
Sa, July 12No progress. Went for a hike :-)
Su, July 13No progress.
Mo, July 14No progress.
Tu, July 15No progress.
Wd, July 16No progress.
Th, July 17No progress.
Fr, July 18SSJ--1. About 3 hours researching and thinking about definition of a "system" in the context of OIS. I think I have a grasp on the idea I want to describe now, but just need to figure out how to write it down.
Sa, July 19No progress.
Su, July 20No progress.
Mo, July 21SSJ--1. Worked on definition of "outcome", "influence", and "system" while on bus ride home from lecture.
Tu, July 22SSJ--3. Spent 3 or 4 hours starting to draft an explanation of my research interests to reference while asking math profs at my university for help honing my math study plan.

Sprint Summary

Well, I'm glad I am now including a daily worklog. It is embarrassing that I failed to get any work done so many days, and I do not wish to repeat this during the next sprint, but as the Litany of Gendlin says, "What is true is already so. Owning up to it doesn't make it worse." and another good one, the Litany of Tarski, "If I haven't been managing my time well, I desire to believe that I haven't been managing my time well." Or, a personal saying of my own, "The first step to influencing a variable is being able to read it's current value".

How did I do with each of my goals?

SSJ--1 -- work on my OIS article

I did get some work done on this. I referenced definitions in other fields, but ended up using them to inform my thinking on the OIS definition. I think it makes more sense to get that fairly fleshed out before actually writing about other fields since the goal is to describe a mapping from the terminology of each field into OIS terminology. So it's still useful to study other fields, but not to start writing sections on them yet.

Still, I think it would be good to focus on something else for the next sprint. The OIS document is going to take me a good amount of time to complete.

I think next sprint I will switch to writing a literature review of AIA glossaries and terminology. This will be good in itself, and will help me verify my intuition that current AIA terminology is a mess and that we need a new paradigm such as OIS. Alternatively, if I disprove that intuition, I will save myself a lot of wasted effort!

SSJ--2 -- Read VK LTA and write a small summary with my thoughts.

I spent a good amount of time reading this, but not in a context where I was taking notes on it as I read, which I think is a mistake. For future reading I'm going to prioritize only reading when I can be active about it, not treating it like something I can passively do on my phone.

The thoughts I do have on VK's LTA are:

Also, a career advisor in an EA thread recommended I read Shallow Review of Technical AI Safety 2024, so I'm setting that as next sprint's reading. I will continue VK LTA some other time.

SSJ--3 -- Math

Didn't spend any time studying math, but I did start writing an email to send to math professors and immediately ended up yak shaving, writing a description of my current research directions and what math I am aware of relating to them. Oh well, that's probably a good thing to do anyway, so I've added it as a SSJ-1, writing task, for the next sprint.

SSJ--4 -- Go through Transformers From Scratch.

Did not start this 😥 Adding it unchanged to the next sprint.

SSJ--5 -- Literature review on MI Tooling and Etc...

Did not start this 😥 Adding it unchanged to the next sprint.

Goals for 3rd Sprint

In addition to my 5 focuses, I'm adding a 6th! I realize a lot of the work I'm wanting to do is getting feedback from people on things and networking, so I'm making that more explicit, giving it it's own category going forward.

Additionally, I want to put a focus on making things that are "feedback friendly". What do I mean by this?

I want to keep some focus on idea of "feedback ready work" going forward. Critiquing other agendas, pointing out things I think are flaws and how my work fit's int the context of those flaws seems like a valuable strategy. I shouldn't just be just be reading agenda's I agree with, but also one's I disagree with.

The Goals:



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 机制可解释性 研究冲刺 OIS 学习方法
相关文章