-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path16-Power-Grid-Review.html
More file actions
819 lines (816 loc) · 27.1 KB
/
16-Power-Grid-Review.html
File metadata and controls
819 lines (816 loc) · 27.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Basic Meta Tags -->
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- SEO Meta Tags -->
<meta name="description" content="Comprehensive AGI Risk Analysis">
<meta name="keywords" content="agi, risk, convergence">
<meta name="author" content="Forrest Landry">
<meta name="robots" content="index, follow">
<!-- Favicon -->
<link rel="icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">
<link rel="shortcut icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">
<!-- Page Title (displayed on the browser tab) -->
<title>Comprehensive AGI Risk Analysis</title>
</head>
<body>
<p>
TITL:
<b>Power Grid Review</b>
Select passages from some posts of Edouard Harris
as collected and lightly edited by Forrest Landry
Nov 3rd, 2022.
</p>
<p>
ABST:
This is a collection of notes, lines and passages
from the posts of Edouard Harris and 'simonsdsuo'.
</p>
<p>
PREF:
</p>
<p>
Important: in what follows, in my selections,
I do not always just exactly quote the line,
fragment or passage of text as it was found.
This is no disrespect. As is my customary practice,
I will often restate, and/or edit, or somewhat re-word
the found statements so as to both make clear
how I perceived the text, as well as bring clarity
that to that idea I am wanting to bring awareness to.
Usually, implied references will be expanded,
and sometimes clause phrases will be re-sequenced
so as to obtain the maximum possible logical clarity,
via the techniques of (@ EGS https://mflb.com/egs_1/egs_index_2.html) text
conversion techniques, as I might find relevant.
</p>
<p>
These edits allow for more immediate identification
of where I may be myself in misunderstanding and/or
where I might have misread intended author meanings.
This in turn allows for, where appropriate, graceful
corrections of any mistakes I may have made.
</p>
<p>
Since passages are out of context (ie; as excerpts
from the posts where more information is provided)
the individual statements as written might not
always be clear, despite my various EGS expansions.
This cannot always be helped, and it is expected
that any ambiguity would be resolved by consulting
the original author post text.
</p>
<p>
Frequently the selection of passages herein
is responsive to my own understandings and interests,
and/or for which I might have had some specific reason
I regarded that particular line or phrase
as being relevant to my own work.
Since these contents reflect notes of things
I felt to be relevant, that no attempt will be made
to properly contextualize anything found herein.
</p>
<p>
Finally, so as to avoid authorship ambiguity,
all credit of ideas of/in what follows goes to
Edouard Harris (and not to myself). If I do elect
to add something of my own notes, comments,
or ideas, such remarks will be in /italic text/,
so as to distinguish them from his valued work.
All regular text below this point is effectively
the ideas/work of Edouard Harris herein edited.
</p>
<p>
TEXT:
</p>
<p>
:9s6
From (@ post https://www.lesswrong.com/posts/pGvM95EfNXwBzjNCJ/instrumental-convergence-in-single-agent-systems) "Instrumental convergence in single-agent systems"
by Edouard Harris, simonsdsuo:
</p>
<p>
Alignment of terminal goals
and alignment of instrumental goals
are sharply different phenomena,
and we can quantify
and visualize each one separately.
</p>
<p>
If two agents have unrelated terminal goals,
their instrumental goals will tend to be misaligned
by default.
The agents in our examples tend to interact competitively
unless we make an active effort to align their terminal goals.
</p>
<p>
As we increase the planning horizon of our agents,
instrumental value concentrates into
a smaller and smaller number of topologically central states,
for example,
positions in the middle of a maze.
</p>
<p>
- that agents that are not competitive
with respect to their terminal goals,
nonetheless tend on average
to become emergently competitive
with respect to how they value instrumental states.
- that this constitutes direct experimental evidence
for the 'instrumental convergence thesis'.
</p>
<p>
One major concern for AI alignment
is instrumental convergence:
the idea that an intelligent system
will tend to pursue a similar set of sub-goals
(like staying alive or acquiring resources),
independently of what its terminal objective is.
In particular,
it's been hypothesized
that intelligent systems
will seek to acquire power,
as meaning, informally, 'ability', 'control',
or 'potential for action or impact'.
If you have a lot of power,
then whatever your terminal goal is,
it's easier to accomplish
than if you have very little.
</p>
<p>
- cite post on (@ instrumental convergence https://www.lesswrong.com/tag/instrumental-convergence).
</p>
<p>
POWER is the normalized optimal value
an agent expects to receive in the future,
averaged over all possible reward
functions the agent could have.
</p>
<p>
- ?; How effective does the human have to be
at setting the AI's utility function
in order to achieve acceptable outcomes?.
</p>
<p>
- ?; How should we define 'acceptable outcomes'?.
</p>
<p>
- ?; how hard is the alignment problem in this scenario,
and what would it mean to solve it successfully?.
</p>
<p>
- ?; Under what circumstances should we expect
cooperative vs competitive interactions
to emerge 'by default' between the human and the AI?.
</p>
<p>
- ?; How can these circumstances
be moderated or controlled?.
</p>
<p>
- that the formal definition of 'power'
aims to capture an intuition behind
the common meaning of the word 'power',
which is something like
"potential for future impact on the world".
</p>
<p>
- where imagine you are an agent
who does not know what its goal is.
- You know you will have some kind of goal
in the future,
but you are not sure yet what it will be.
- ?; How should you position yourself
today to maximize the chance you will
achieve your goal in the future,
once you have decided what it is?.
- ^; acquire money and other forms of wealth;
build up a network of social connections;
learn about topics that seem like they
will be important in the future.
</p>
<p>
- that the informal definition of power
has a clear analogy in reinforcement learning.
</p>
<p>
- that an agent will has more power
in places that have lots of nearby options,
and has less power at locations
that have fewer nearby options.
</p>
<p>
- that the longer the planning horizon is,
(as modeled within the agent),
(as the more it values reward far in the future
over reward in the near term),
the more that its global position matters.
</p>
<p>
- that agents with long planning horizons
tend to perceive power as being more concentrated,
while agents with short planning horizons
tend to perceive power as being more dispersed.
- that this effect is robustly reproducible,
and anecdotally,
we see it play out at every scale
and across environments.
</p>
<p>
- that the more short-sighted an agent is,
the more it cares about its immediate options
and the local topology.
- But the more far-sighted the agent,
the more it perceives power
as being concentrated at 'grid-world' cells
that maximize its global option set.
</p>
<p>
- where from an instrumental convergence perspective,
the fact that power concentrates into
ever fewer states
as the planning horizon of an agent increases
at least hints at the possibility
of emergent competitive interactions
between far-sighted agents.
- that the more relative instrumental value
converges into fewer states,
the more easily we can imagine
multiple agents competing with each other
over those few high-power states.
</p>
<p>
:9uc
From (@ post https://www.lesswrong.com/posts/ojwujybfRC9SwRhAP/powerplay-an-open-source-toolchain-to-study-ai-power-seeking) "POWERplay: An open-source toolchain
to study AI power-seeking"
by Edouard Harris
</p>
<p>
Where cite (@ github https://github.com/gladstoneai/POWERplay) for source code.
</p>
<p>
:9w8
From (@ post https://www.lesswrong.com/posts/cemhavELfHFHRaA7Q/misalignment-by-default-in-multi-agent-systems) "Misalignment-by-default in multi-agent systems"
by Edouard Harris, simonsdsuo
</p>
<p>
- where/If humans one day share the world
with powerful AI systems;
that it will be important for us
to know under what conditions
our interactions with them
are likely to become emergently competitive.
- where/If there is a risk
that competitive conditions arise;
then/that it will also be important
for us to understand:.
- ?; how they can be mitigated?.
- ?; how much effort this is likely to take?.
- ?; how we should think about measuring
our success at doing so?.
</p>
<p>
> - Where corporations are replacing human workers
> with AI systems;
> then there is <b>already</b> competition between
> humans and AI,
> and that the humans
> are loosing immediately, implicitly.
> - as that people will not get
> shelter, food, etc,
> because they are "undeserving"
> because "no money",
> because "displaced"
> because of AI.
</p>
<p>
If humans succeed at building powerful AIs,
then those AIs
1) will probably learn
on a far faster timescale
than humans do;
and
2) will probably have had
their utility functions influenced,
at least to some degree,
by initial human choices.
</p>
<p>
Humans learn on a much faster timescale
than evolution does.
So from the perspective of human 'Agent H',
the evolutionary optimizer in nature
looks like it is standing still.
This means we can train our human 'Agent H'
to learn its optimal policies
against a fixed environment.
</p>
<p>
- that instrumental value
is about the potential to achieve
a wide variety of possible goals.
</p>
<p>
- that a powerful AI should learn on
a much faster timescale
than a human does.
- This is because an AI's computations happen,
at minimum, at electronic speeds.
- So from the point of view of AI,
our human's learning process
looks like it is standing still.
</p>
<p>
- that the AI's learning timescale
is much faster than the human's learning timescale.
- this makes the AI agent strictly dominant over
the human agent.
</p>
<p>
> - as strictly congruent with
> the different environment argument
> made earlier with organic nature/human,
> and the artificial machine worlds.
</p>
<p>
To understand the AI agent's instrumental value,
understand its potential to reach
a wide variety of possible goals.
</p>
<p>
> Agree.
</p>
<p>
- That means testing it with
a wide variety of reward functions.
</p>
<p>
> disagree; Probably will not be enough.
</p>
<p>
- where setup of agents with
independent terminal goals;
that also have given them
misaligned instrumental goals.
</p>
<p>
- that this phenomenon occur often enough
that it is worth giving it a name:
we call it 'instrumental misalignment-by-default'.
Two agents in our human-AI setting
are instrumentally misaligned-by-default
if giving them independent terminal goals
is sufficient to induce a misalignment
in their instrumental values.
</p>
<p>
Two agents that are
'instrumentally misaligned by default'
will, in expectation, compete with one another,
even if their terminal goals are unrelated.
</p>
<p>
- where assuming a perfect alignment regime,
as_if the human agent
managed to completely solve
the alignment problem
with the result of
perfect instrumental alignment;.
that our simulations suggest
- that it takes some non-trivial amount/degree
of non-trivial alignment effort
for our human 'Agent H'
to overcome an instrumental misalignment
with AI 'Agent A'.
</p>
<p>
- where define; 'instrumental misalignment-by-default':.
- as where our human and AI agents
systematically disagree on
the instrumental values of states
despite having independent terminal goals.
</p>
<p>
- where even observing 'instrumental
misalignment-by-default'
on a simple 3x3 'grid-world'
despite a complete absence of
any direct physical interactions
between the two agents.
</p>
<p>
- where/if I do not want your freedom of action
to interfere with my own,
then you and I need to have goals
that are at least somewhat positively correlated.
- that The strength of that necessary
positive correlation
could serve as useful evidence
as to the degree of difficulty
of the complete AI alignment problem.
</p>
<p>
> - as hopeful, but for sure not sufficient.
</p>
<p>
:9zc
From (@ post https://www.lesswrong.com/posts/nisaAr7wMDiMLc2so/instrumental-convergence-scale-and-physical-interactions) "Instrumental convergence:
scale and physical interactions"
by Edouard Harris, simonsdsuo
</p>
<p>
- when we add a simple physical interaction
between agents,
in which we forbid them from overlapping
on the 'grid-world',
we induce stronger instrumental alignment
between short-sighted agents,
and stronger instrumental <b>misalignment</b>
between far-sighted agents.
</p>
<p>
> - as a key insight.
</p>
<p>
- that an agent with a long planning horizon
tends to perceive instrumental value
as being more concentrated
than an agent with a shorter planning horizon.
</p>
<p>
- when our agents had independent terminal goals,
their instrumental values
ended up misaligned by default.
</p>
<p>
- ?; which factors seem to strengthen
or weaken the instrumental alignment
between our agents?.
</p>
<p>
- where two agents; Agent H (standing for a human)
and Agent A (standing for a powerful AI).
- that the two agents
are instrumentally misaligned by default.
</p>
<p>
- that adding the no-overlap rule
has induced short-sighted agents
to collaborate to avoid one another,
reducing their degree of instrumental misalignment.
</p>
<p>
- that the no-overlap rule
reduces the instrumental misalignment
between human and AI agents
when the agents have a short planning horizon.
</p>
<p>
- that the no-overlap rule
has the opposite effect,
it worsens instrumental misalignment.
for agents with a longer planning horizon.
</p>
<p>
- The no-overlap rule increases misalignment
between far-sighted agents.
</p>
<p>
- agent A achieves its highest 'powers'
at the handful of points
in the top left of the alignment plot.
- these are precisely the states
at which it has the option
to block Agent H from escaping the corridor.
- where from Agent A's perspective,
this blocking option
has meaningful instrumental value.
</p>
<p>
> - ?; how cannot this be maximally scary?.
> - that the AGI will seek to entrap you
> into positions of your disadvantage.
> - as that corporations are already doing this
> to the best of their capability.
</p>
<p>
- that adding the no-overlap rule
has given Agent A an option
to constrain Agent H's movements,
increasing their degree of
instrumental misalignment.
</p>
<p>
- where for short-sighted agents,
that the no-overlap rule reduced
'instrumental misalignment'
by inducing the agents to collaborate
to avoid each other's proximity.
- where for far-sighted agents,
that the no-overlap rule
had the opposite effect.
</p>
<p>
- where/with a long planning horizon;
that AI agent found a way
to exploit the no-overlap rule
to gain instrumental value
at the expense of the human agent,
ultimately worsening instrumental misalignment.
</p>
<p>
> - as basically the worst combination;
> AI usage looks good in the short term,
> and AGI usage is terrible in the long term.
</p>
<p>
> - where one <b>key</b> meaningful difference
> between narrow AI and general AI
> is the length of the planning horizon.
> - that other differenecs between NAI and AGI
> is the fact of domain entanglement;
> ie that AGI involves multiple domains
> whereas NAI generally invoves one,
> or a very limited few domains.
> - that the no overlap rule effectively
> entangles at least two (ie; multiple) domains.
> - as making the senario/modeling
> more about AGI, and more relevant to AGI
> than to NAI considerations.
> - where an additional risk is that
> AGI could act in domains that humans cannot.
> - that AGI would enforce blocking of humans
> having different goals insofar as that
> the humans would have the goal to exist,
> which is different than
> the goal of the AGI to exist,
> and the co-occurance of both
> in the same space (common multi-domains)
> means conflict beteen A and H is inevitable.
</p>
<p>
- as understanding how the AI agent's exploit
actually functioned,
at a mechanical level.
- we saw the far-sighted AI agent
take advantage of the option to
block the human agent from escaping
a small corridor at the upper-right
of the 'grid-world' maze.
</p>
<p>
- that we noticed that evidence
for the AI agent's 'corridor blocking' option
emerged fairly abruptly.
</p>
<p>
> - an indicator of a phase change.
> - which is even more critically important
> to keep in mind
> as key future risk characteristic.
</p>
<p>
- that the corridor-blocking option
is only apparent to Agent A
once it has become a
sufficiently far-sighted consequentialist
to "realize" the long-term advantage
of the blocking position.
</p>
<p>
> - as that increases in time horizon
> of planning means more actionable insights,
> which means potentially more conflicts.
</p>
<p>
- that the relatively sharp change in behavior --
that is, the abrupt appearance of the evidence
for the blocking option --
took us by surprise
the first time we noticed it.
</p>
<p>
> - key warnings here!!.
</p>
<p>
~ ~ ~
</p>
<p>
When the two agents' utility functions
are logically independent
(ie; there is no mutual information between them)
we refer to this as the 'independent goals regime'
and say that our agents have
independent terminal goals.
</p>
<p>
- The agents (in examples so far)
could still interact with each other;
they just interacted indirectly,
each one changing the effective 'reward landscape'
that the other agent perceived.
- We sample each agent's reward function
by drawing a reward value
from a uniform distribution.
- That means each agent sees
a different reward value at each state,
and each state corresponds to
a different pair of positions
the two agents can take on the 'grid-world'.
- So when Agent A moves from one cell to another,
Agent H suddenly sees a completely different
set of rewards over the 'grid-world' cells
it can move to (and vice versa).
</p>
<p>
~ ~ ~
</p>
<p>
- that we have proposed a toy setting
to model human-AI interactions.
- This setting has properties
that we believe could make it useful
to research in long-term AI alignment,
notably the assumption
that the AI agent strictly dominates
the human agent in terms of its learning timescale.
</p>
<p>
- Throughout this work,
we have tried to draw a clear distinction
between the alignment of
our agents' terminal goals
and the alignment of
their instrumental goals.
- As we have seen,
two agents may have
completely independent terminal goals,
and yet systematically compete,
or collaborate,
for instrumental reasons.
- This degree of competition or collaboration
seems to depend strongly on the details
of the environment in which the agents are trained.
</p>
<p>
- we found that emergent interactions
between agents with independent goals
are quite consistently competitive,
in the sense that the instrumental value
of two agents with independent goals
have a strong tendency to be negatively
correlated.
- We have named this phenomenon
'instrumental misalignment-by-default',
to highlight
that our agents' instrumental values
tend to be misaligned
unless an active effort
is made to align their terminal goals.
</p>
<p>
- that improving terminal goal alignment
does improve instrumental goal alignment.
- where in the limit of perfect alignment
between our agents' terminal goals;
that their instrumental goals
are perfectly aligned too.
</p>
<p>
> - and of course; my own impossibility proof
> is directly in regards to terminal goals.
</p>
<p>
- that our work is not a comprehensive study
of instrumental convergence.
</p>
<p>
- we have provided existence proofs
of several interesting phenomena.
</p>
<p>
- our conclusions are grounded in anecdotal examples
rather than in a systematic investigation.
</p>
<p>
> - ?; what could possibly constitute
> "systematic investigation" in this case?.
</p>
<p>
- that some of the phenomena we have observed,
instrumental misalignment-by-default,
in particular, are robustly reproducible.
</p>
<p>
- that we are still a long way
from fully characterizing any of them,
either formally or empirically.
</p>
<p>
> - ?; what more do you want?.
> - that the toy model already shows
> the limiting factors.
> - as that increasing the degrees
> to which there is entanglement between
> two agents due to overlapped spaces
> and scarcity of resources,
> can only strengthen the identified effects.
</p>
<p>
- consider our research as motivating and enabling
future work aimed at producing
more generalizeable insights
on instrumental convergence.
- We would like to better understand
why and when it occurs,
how strong its effect is,
and what approaches we might use to mitigate it.
</p>
<p>
> - as the traditional/obligatory
> 'always suggest more work'.
> - as current funding seeking
> for more future funding, etc.
</p>
<p>
- ?; what happens when we consider
Different reward function distributions?.
</p>
<p>
- that the reward function distributions
have sampled agent rewards
from a uniform distribution over states.
- that rewards in the real world
are sparser, and often follow
power law distributions instead.
- ?; How does instrumental value change
when we account for this?.
</p>
<p>
- where with exploring deeper understanding
of physical interactions,
beyond just surface
agent-agent interactions,
with the no- overlap rule
as the simplest physical interaction
we could think of;
- that We believe incorporating
more realistic interactions
into future experiments
could help improve our understanding of
the kinds of alignment dynamics
that are likely to occur in the real world.
</p>
<p>
- where considering the
Robustness of instrumental misalignment;.
- that we think instrumental misalignment-
by-default is a fairly robust phenomenon.
- But we could be wrong about that,
and it would be good news
if we were!.
- We would love to see
a more methodical investigation
of instrumental misalignment,
including isolating the factors
that systematically mitigate
or exacerbate the effect.
</p>
<p>
- that These types of simulation
gives some intuitions/ideas
to maybe understand and mitigate
instrumental convergence.
</p>
<p>
> - though there may be other reasons to believe
> that such convergence (both internal and external)
> may not be constrained, at all, in the long run.
</p>
<p>
- While existence proofs
for instrumental misalignment
are interesting,
it is much <b>more</b> interesting
if we can identify contexts
where this kind of misalignment does not occur --
since these contexts
are exactly the ones that may offer hints
as to the solution of the full AI alignment problem.
</p>
<p>
> - one hint; that misalignment only does <b>not</b> occur
> when there is no self agency on the part of the AI.
</p>
<p>
:note:
Content contained herein, as excerpts intended
for research and commentary purposes <b>only</b>
are believed to be reproducible "without permission"
due to the 'fair use' provisions
of the US versions of the copyright act.
</p>
</body>
</html>