Published on June 18, 2025 11:36 PM GMT
So, rough elevator pitch, what is this?
I quit my job as a technologist to get a CS degree because I want to work on AI Alignment (AIA) and Mechanistic Interpretability (MI). This summer I am taking my final class in my program, so I want to use a Self Study Journal (SSJ) to improve my AIA relevant skills. I hope to get peer and mentor engagement to help me become a valuable researcher, and for networking to find funding opportunities or paid fellowships. My convocation is in November. My goal is to have found a role by that time.
I want feedback for the value of other peoples insight, and also to help keep motivated with extra accountability, so please lower your inhibition to commenting here. If you would normally think "I don't have anything valuable to contribute" or "it would take too long to write up my thoughts" instead, please leave a comment saying "Good Luck". Thanks : )
I am planning a rough, overarching outline and then making more concrete plans for sprints of work each of which will last one or two weeks. After each sprint I will publish the results of the sprint and the plans for the next sprint.
My overarching outline is divided into 5 categories:
- SSJ--1: Write articles developing my own ideas and understandingSSJ--2: Survey agendas and other AIA ideas I’m interested inSSJ--3: Study and practice mathSSJ--4: Do some small projects to familiarize myself with Transformers and Language ModelsSSJ--5: Continue work on my ongoing project, NDSP
SSJ--1. Articles to Write
I have a few original ideas that I’m not aware of other people working on. I’d like to write up the ideas to help me practice the development and communication of original ideas, as well as to explore whether any of these ideas have merit that I can communicate to others. A good outcome would be any of:
- Getting people focused on some new topics relating to AIA.Learning the existing terminology for the exploration of the ideas I'm thinking about.Coming to understand the flaws in my ideas and how I am communicating them, both specific to the ideas I present, and to general trends in my ideas and presentation.
The following is a bullet point list of the articles I’m currently interested in writing. I don’t think they will be fully legible here, but if you are curious, please leave a comment asking about them.
- Outcome Influencing Systems (OISs)
- This is my idea that “AI” or “model” is the wrong object of study for AIA. Like how airplanes require aerodynamics, not bird studies, the object relevant to study I think are various kinds of OISs, and having a terminology divorced from various Sci-Fi and other historical contexts would be valuable, especially if that terminology is linked to rigorous definitions.I would love comments on my WIP here: OISIt would be nice to include discussion of composition, task-space, semantic spaces.Threat model based on OISs
- Survey of RSI forecasting? (SSJ--2?)Can all possible threat models be formalized / abstracted in OIS terms?
- I think current work in MI has too much focus on vector “direction” and not enough on “position”. I’d like to review if that is the case, and explore the case for “position” replacing “direction” as the fundamental object of study in network activations.
- I frequent circles with artists who really dislike AI. I think much of their pessimism is tied to current day incentive and compensation structures. I’d like to disambiguate between that and the potential of AI for the arts, which I think could be very good and noble.
- With relation to Simulator Theory, I think understanding LLMs as mappings from sequences of words first to a potential author context and then from that author context to the next word that author would choose. I think “author contexts” have some combination of the semantics of the sequence as well as the author implied by that sequence. If this is true, it would be possible to separate the semantic and author spaces which would allow really cool study of the space of possible authors as well as making some progress on ELK, under the assumption that semantic spaces can be understood as orthogonal to the authors, which may turn out to be false.
- The n-dimensional scatter plot (NDSP) is a tool I think could be very valuable, especially for interpretability research. I’ve been working on it for a while, but I would like to write out a cleaner explanation of a bunch of my ideas and findings.
SSJ--2. Survey of AIA ideas
I have been collecting topics I want to get a better understanding of for a long time, but now that my school curriculum is lighter, I will have time to actually dive into these topics. There’s too much here to write up the “what” and “why” of each item, but as I am working through them I will try to provide a summary of my understandings and opinions which I hope will be valuable both for expanding focus on the topics and for checking my own understanding.
- AIA Stuff:
- VK LTA
- AIXI
- InfoVis particularly:
- Dimension ReductionClustering (density?)Mech Interp tools
- A Barebones Guide to Mechanistic Interpretability Prerequisites — Neel NandaConcrete Steps to Get Started in Transformer Mechanistic Interpretability — Neel NandaA Comprehensive Mechanistic Interpretability Explainer & Glossary — Neel Nanda
SSJ--3. Math
I enjoy math, so I know I’m going beyond what is necessary for MI, but I also think having a rigorous definition of what you are talking about is very valuable in many contexts, so for those reasons, I want to learn some new math topics and to review and practice some old ones. The topics I’m interested in are:
- LogicCategory TheoryComputational MechanicsAbstract AlgebraLinear AlgebraProbability and Statistics:
- That “all of statistics” bookET Jaynes Probability book
I think I may start out by going through “Topoi, The Categorial Analysis of Logic” by Robert Goldblatt and “Linear Algebra Problem Book” by Paul R. Halmos.
The category theory book is because of my interest in logic and proof, and because I find the idea that category theory can help one understand the connections between various branches of math very satisfying. The linear algebra is because I want to have good intuitions about , where Neural Net parameters and activations live.
SSJ--4. LLM Projects
In the pursuit of becoming an AIA and MI researcher it is important to actually research some AI models. I have worked with convolutional VAE and RL models, but have never worked with transformers. I need to get familiar with them and I also want to get some experience using cloud resources to work with larger models.
I think I’ll start out doing some mucking around which I may or may not write up in much detail before trying to choose some minor MI experiments to try. I will probably also want to combine these efforts with NDSP as I make progress on making a more general tool.
SSJ--5. NDSP
I’m very inspired by Mingwei Li’s work, especially Toward Comparing DNNs with UMAP Tour. I would like to build tools for working with and understanding data distributions in high dimensional spaces. I have two major goals with this project.
(1) Develop some easy to use tools. This could look like a library like matplotlib that can be used from within jupyter notebooks, or it may look more like a web based data analysis tool. Ideally it would have both.
(2) Make high dimensional structures more intuitively understandable. The first aspect of this is developing a visual language for displaying these structures and the second aspect is making tutorials to help people generalize from simple objects such as hyper-cubes, simplexes, and hyper-spheres to more complicated scenes that may appear in actual high dimensional data distributions. I might also be interested in writing some simple games like 4d pong, n-d maze, or n-d minesweeper. I think games are a great way for people to build intuitions.
I think the first tasks here are to write up some documentation of my ideas and to explore tesorflowjs as a library to use in development.
Goals for my 1st Sprint
- SSJ--1: Finish writing the first draft of the definition section of my OIS article.SSJ--2: Read VK LTA and write a small summary with my thoughts.SSJ--3:
- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.Start studying Topoi and Linear Algebra textbooks.
- Read Neel’s “Mech Interp Prereqs”.Do some research and write a little bit about my plans for messing around with LLMs in some capacity.
- Review my NDSP notes.Experiment with tensorflowjs.
Wish me luck : )
Discuss