There is not much controversial about the idea that an AGI should have, among its goals, the goal of radically improving itself.
A bit dodgier is the notion that an AGI should have, among its goals, the goal of updating and improving its goals based on its increasing knowledge and understanding and intelligence.
Of course, this sort of ongoing goal-refinement and even outright goal-revolutionizing is a key part of human personal development. But where AGIs are involved, there is concern that if an AI starts out with goals that are human-friendly and then revises and improves its goals, it may come up with new goals that are less and less copacetic to humans.
In principle if one’s goal is to create for oneself a new goal that is, however, compatible with the spirit of one’s old goal — then one shouldn’t run into major problems. The new goal will be compatible with the spirit of the old goal, and part of the spirit of the old goal is that any new goals emerging should be compatible with the spirit of the old goal — so the new goal should contain also the proviso that any new new goals it spawns will also be compatible with its spirit and thus the spirit of the old goal. Etc. etc. ad infinitum.
But this does seem like a “What could possibly go wrong??” situation — in which small errors could accumulate as each goal replaces itself with its improved version, the improved version of the improved version etc. … and these small errors compound to yield something totally different from the starting point.
My goal here is to present a novel way of exploring the problem mathematically — and an amusing and interesting, if not entirely reassuring tentative conclusion, which is:
- For an extremely powerful AGI mind that is the result of repeated intelligent, goal-driven recursive self-modifications, it may actually be the case that recursive self-modification leaves goals approximately invariant in spirit
- For AGIs with closely human-like goal systems — which are likely to be the start of a sequence of repeated intelligent, goal-driven recursive self-modifications — there is no known reason (so far) to believe recursive self-modification won’t cause radical “goal drift”
Quasi-Formalizing Goal-Driven Recursive Self-Improvement
Consider the somewhat vacuous goal:
My goal is to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version
or better yet the less vacuous
My goal is to achieve A and also to improve my goal (in a way that is consistent with the spirit of the original goal) and to fulfill the improved version
where say
A = “militate toward a world where all sentient being experience copious growth, joy and choice”
or whatever formulation of “highly beneficial” you prefer.
We might formulate this quasi-mathematically as
Fulfill G = {achieve A; and create G1 so that G1 > G and G==>G1 ; and fulfill G1}
Here by G==>G1 I mean that G1 fulfills the spirit of G (and interpretation of “spirit” here is part of the formulation of G), and by G1 > G I mean that G1 can be produced by combining G with some other entity H that has nonzero complexity (so that G1 = G + H)
A more fleshed out version of this might be, verbally,
My goal is to 1) choose actions highly compatible with all sentient beings experiencing a lot of growth, joy and choice; 2) increase my intelligence and knowledge; 3) improve the details of this goal appropriately based on my increased knowledge and intelligence, in a manner compatible with the spirit of the current version of the goal; 4) fulfill the improved version of the goal
This sort of goal obviously can lead to a series such as
G, G1, G2, G3, …
One question that emerges here is: Under what conditions might this series converge, so that once one gets far enough along in the series, the adjacent goals in the series are almost the same as each other?
To explore this, we can look at the “limit case”
Fulfill Ginf = {achieve A; and create Ginf so that Ginf > Ginf and Ginf ==> Ginf ; and fulfill Ginf}
The troublesome part here is Ginf>Ginf which looks not to make sense — but actually makes perfect sense so long as Ginf is an infinite construct, just as
(1, 1, 1, …) = append( 1, (1,1,…))
Inasmuch as we are interested in finite systems, the question is then: Is there a sense in which we can look at the series of finite Gn as converging to this infinite limit?
Self-referential entities like Ginf are perfectly consistently modelable within ZFC set theory modified to use the Anti-Foundation Axiom. This set theory corresponds to classical logic enhanced with a certain sort of inductive logical definition.
One can also put a geometry on sets under the AFA, in various different ways. It's not clear what geometry makes most sense in this context, so I'll just describe one approach that seems relatively straightforward.
Each hyperset (each set under AFA) is associated with a directed pointed graph called its apg. Given a digraph and functions r and p for assigning contraction ratios and probabilities to the edges, one gets a DGIFS (Directed Graph Iterated Function System), whose attractor is a subset of finite-dimensional real space. Let us call a function that assigns (r,p) pairs to a digraph a DLF or Digraph Labeling Function. A digraph then corresponds to a function that maps DLFs into spatial regions. Given two digraphs D1 and D2, and a DLF F, let F1e and F2e denote the spatial regions produced by applying F to D1 and D2, discretized to ceil(1/e) bits of precision. One can then look at the average over all DLFs F (assuming some reasonable distribution on DLFs) of: The least upper bound of the normalized information distance NID(F1e, F2e) over all e>0. This gives a measure of two hypersets, in terms of the distance between their corresponding apgs. It has the downside of requiring a "reference computer" used to measure information distance (and the same reference computer can then be used to define a Solomonoff distribution over DLFs). But intuitively it should result in a series of ordinary sets that appear to logically converge to a certain hyperset, actually metrically converging to that hyperset.
Each hyperset (each set under AFA) is associated with a directed pointed graph called its apg. Given a digraph and functions r and p for assigning contraction ratios and probabilities to the edges, one gets a DGIFS (Directed Graph Iterated Function System), whose attractor is a subset of finite-dimensional real space. Let us call a function that assigns (r,p) pairs to a digraph a DLF or Digraph Labeling Function. A digraph then corresponds to a function that maps DLFs into spatial regions. Given two digraphs D1 and D2, and a DLF F, let F1e and F2e denote the spatial regions produced by applying F to D1 and D2, discretized to ceil(1/e) bits of precision. One can then look at the average over all DLFs F (assuming some reasonable distribution on DLFs) of: The least upper bound of the normalized information distance NID(F1e, F2e) over all e>0. This gives a measure of two hypersets, in terms of the distance between their corresponding apgs. It has the downside of requiring a "reference computer" used to measure information distance (and the same reference computer can then be used to define a Solomonoff distribution over DLFs). But intuitively it should result in a series of ordinary sets that appear to logically converge to a certain hyperset, actually metrically converging to that hyperset.
Measuring distance between two non-well-founded sets via applying this distance measure to the apg's associated with the sets, yields a metric in which it seems plausible the series of Gn converges to G.
“Practical” Conclusions
Supposing the above sketch works out when explored in more detail -- what would that mean?
It would mean that approximate goal-preservation under recursive self-improvement is feasible — for goals that are fairly far along the path of iterated recursive self-improvement.
So it doesn’t reassure us that iterated self-improvement starting from human goals is going to end up with something ultimately resembling human goals in a way we would recognize or care about.
It only reassures us that, if we launch an AGI starting with human values and recursive self-improvement, eventually one of the AGIs in this series will face a situation where it has confidence that ongoing recursive self-improvement isn’t going to result in anything it finds radically divergent from itself (according to the above normalized symmetric difference metric).
The image at the top of this post is quite relevant here -- a series of iterates converging to the fractal Koch Snowflake curve. The first few iterates in the series are fairly different from each other. By the time you get to the 100th iterate in the series, the successive iterates are quite close to each other according to standard metrics for subsets of the plane. This is not just metaphorically relevant, because the metric on hyperset space outlined above works by mapping each hyperset into a probability distribution over fractals (where each fractal is something like the Koch Snowflake curve but more complex and intricate).
It may be there are different and better ways to think about approximate goal preservation under iterative self-modification. The highly tentative and provisional conclusions outlined here are what ensue from conceptualizing and modeling the issue in terms of self-referential forms and iterative convergence thereto.
The image at the top of this post is quite relevant here -- a series of iterates converging to the fractal Koch Snowflake curve. The first few iterates in the series are fairly different from each other. By the time you get to the 100th iterate in the series, the successive iterates are quite close to each other according to standard metrics for subsets of the plane. This is not just metaphorically relevant, because the metric on hyperset space outlined above works by mapping each hyperset into a probability distribution over fractals (where each fractal is something like the Koch Snowflake curve but more complex and intricate).
It may be there are different and better ways to think about approximate goal preservation under iterative self-modification. The highly tentative and provisional conclusions outlined here are what ensue from conceptualizing and modeling the issue in terms of self-referential forms and iterative convergence thereto.