Published on July 23, 2025 3:46 AM GMT
[Epistemic Status: This is an artifact of my self study. I am using it to remember links and help manage my focus. As such, I don't expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say "good work/good luck". I'm hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me. ]
List of acronyms: Mechanistic Interpretability (MI), AI Alignment (AIA), Outcome Influencing System (OIS), n-Dimensional Scatter Plot (NDSP), Vannessa Kosoy's Learning Theoretic Agenda (VK LTA), Machine Learning (ML), Large Language Model (LLM),
Review of 2nd Sprint
My goals for this sprint were:
- SSJ--1 -- Write
- Continue work on my OIS article. (Maybe take a break from this on the next sprint)
- Do some work on the "Prompting for Interdisciplinary Attention" section, with a focus gathering definitions and conceptions for the "system" and "substrate" subsections of the definition section.Finish writing the first draft of the definition section. Skip "system" and "substrate" for now.
- Read VK LTA and write a small summary with my thoughts.
- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.Keep studying Topoi. Next sprint switch to Linear Algebra or Computational Mechanics
- go through Transformers From Scratch.
- Do an informal literature review on MI Tooling and Data Visualization for High Dimensional Data.Places to start for MI Tooling:
- The Interpretability ToolkitTransformerLens & Callum McDougall's guide for it.Nostalgebraist’s transformer-utils libraryGoogle PAIR’s Learning Interpretability Tool (LIT)Google PAIR’s What-If ToolJesse Vig’s BERTVizLOOMCircuitVis
So how did I do?
Daily Worklog
Tu, July 8 | Spent about 4 or 5 hours writing SSJ #2 and then started the document for SSJ #3. About 2 hours of that time was spent writing the section on Neel's MI guide transcribing from my handwritten notes. The other 2 hours was split between everything else. |
Wd, July 9 | No progress. Woke early to go jogging, but didn't get enough sleep so ended up tired and distracted and eventually napped instead of working on this. |
Th, July 10 | SSJ--2. Spent about an hour reading VK LTA while on the bus. |
Fr, July 11 | SSJ--2. Spent about 2 hours reading VK LTA. |
Sa, July 12 | No progress. Went for a hike :-) |
Su, July 13 | No progress. |
Mo, July 14 | No progress. |
Tu, July 15 | No progress. |
Wd, July 16 | No progress. |
Th, July 17 | No progress. |
Fr, July 18 | SSJ--1. About 3 hours researching and thinking about definition of a "system" in the context of OIS. I think I have a grasp on the idea I want to describe now, but just need to figure out how to write it down. |
Sa, July 19 | No progress. |
Su, July 20 | No progress. |
Mo, July 21 | SSJ--1. Worked on definition of "outcome", "influence", and "system" while on bus ride home from lecture. |
Tu, July 22 | SSJ--3. Spent 3 or 4 hours starting to draft an explanation of my research interests to reference while asking math profs at my university for help honing my math study plan. |
Sprint Summary
Well, I'm glad I am now including a daily worklog. It is embarrassing that I failed to get any work done so many days, and I do not wish to repeat this during the next sprint, but as the Litany of Gendlin says, "What is true is already so. Owning up to it doesn't make it worse." and another good one, the Litany of Tarski, "If I haven't been managing my time well, I desire to believe that I haven't been managing my time well." Or, a personal saying of my own, "The first step to influencing a variable is being able to read it's current value".
How did I do with each of my goals?
SSJ--1 -- work on my OIS article
I did get some work done on this. I referenced definitions in other fields, but ended up using them to inform my thinking on the OIS definition. I think it makes more sense to get that fairly fleshed out before actually writing about other fields since the goal is to describe a mapping from the terminology of each field into OIS terminology. So it's still useful to study other fields, but not to start writing sections on them yet.
Still, I think it would be good to focus on something else for the next sprint. The OIS document is going to take me a good amount of time to complete.
I think next sprint I will switch to writing a literature review of AIA glossaries and terminology. This will be good in itself, and will help me verify my intuition that current AIA terminology is a mess and that we need a new paradigm such as OIS. Alternatively, if I disprove that intuition, I will save myself a lot of wasted effort!
SSJ--2 -- Read VK LTA and write a small summary with my thoughts.
I spent a good amount of time reading this, but not in a context where I was taking notes on it as I read, which I think is a mistake. For future reading I'm going to prioritize only reading when I can be active about it, not treating it like something I can passively do on my phone.
The thoughts I do have on VK's LTA are:
- Setting the Value Ontology as "the expectation of the utility of statespaces given the direct observation" seems like it would miss the instrumental utility of taking specific action to move to a location in statespace where direct observations more strictly constrain the statespaces that are possible given that observation. Not sure I'm understanding the math or it's implications correctly though. Something to look into and ask about.Computational Resource Constraints and Frequentist Guarantees seems more likely to fall to Sutton's Bitter Lesson, as opposed to Value Ontology, which if it falls to Sutton's Bitter Lesson, might result in a machine intelligence rather than humankind reaping the benefits. I guess this explains my relatively greater interest in semantic spaces and preference specification.
Also, a career advisor in an EA thread recommended I read Shallow Review of Technical AI Safety 2024, so I'm setting that as next sprint's reading. I will continue VK LTA some other time.
SSJ--3 -- Math
Didn't spend any time studying math, but I did start writing an email to send to math professors and immediately ended up yak shaving, writing a description of my current research directions and what math I am aware of relating to them. Oh well, that's probably a good thing to do anyway, so I've added it as a SSJ-1, writing task, for the next sprint.
SSJ--4 -- Go through Transformers From Scratch.
Did not start this 😥 Adding it unchanged to the next sprint.
SSJ--5 -- Literature review on MI Tooling and Etc...
Did not start this 😥 Adding it unchanged to the next sprint.
Goals for 3rd Sprint
In addition to my 5 focuses, I'm adding a 6th! I realize a lot of the work I'm wanting to do is getting feedback from people on things and networking, so I'm making that more explicit, giving it it's own category going forward.
Additionally, I want to put a focus on making things that are "feedback friendly". What do I mean by this?
- If I can get feedback from reality in the form of a quick experiment, that is very good.Otherwise, I'm looking for feedback from peers and mentors. For that to happen:
- They need to know I am looking for it.It must be easy to find the things I'm looking for feedback on. IE, they are not mentioned briefly in the middle of long, otherwise irrelevant, journal articles. IE, there is an easy to navigate "map" of what work I've done that I'm looking for feedback on.The things I am looking for feedback on are easy to understand and engage with. Ideally mentors can quickly get a sense of what I'm doing and whether the direction I'm going seems good or needs adjusting.
I want to keep some focus on idea of "feedback ready work" going forward. Critiquing other agendas, pointing out things I think are flaws and how my work fit's int the context of those flaws seems like a valuable strategy. I shouldn't just be just be reading agenda's I agree with, but also one's I disagree with.
The Goals:
- SSJ--1 -- Write
- AIA Terminology Lit ReviewMath in my AI Alignment Goals
- Do some Linear Algebra reading and practice.
- Go through Transformers From Scratch.
- Do an informal literature review on MI Tooling and Data Visualization for High Dimensional Data.Places to start for MI Tooling:
- The Interpretability ToolkitTransformerLens & Callum McDougall's guide for it.Nostalgebraist’s transformer-utils libraryGoogle PAIR’s Learning Interpretability Tool (LIT)Google PAIR’s What-If ToolJesse Vig’s BERTVizLOOMCircuitVis
- Email math profs after finished writing "Math in my AI Alignment Goals".Consider other places to find potential mentors.
- Consider what kinds of feedback I am looking for.Should I reach out specific people for general advice on my SSJ or only once I have specific questions for them and their work?Make some posts in various forums asking for people willing to review and comment on my SSJ.
Discuss