agi-risk/34-Alignment-Drift.html at main · HyperCrowd/agi-risk · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
  <!-- Basic Meta Tags -->
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">

  <!-- SEO Meta Tags -->
  <meta name="description" content="Comprehensive AGI Risk Analysis">
  <meta name="keywords" content="agi, risk, convergence">
  <meta name="author" content="Forrest Landry">
  <meta name="robots" content="index, follow">

  <!-- Favicon -->
  <link rel="icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">
  <link rel="shortcut icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">

  <!-- Page Title (displayed on the browser tab) -->
  <title>Comprehensive AGI Risk Analysis</title>
</head>
<body>
  <p>
  TITL:
     <b>Alignment Drift</b>
     <b>By Forrest Landry</b>,
     <b>October 8th, 2022</b>.
  </p>
  <p>
  ABST:
     - some dialogue and commentary,
     arguments, etc, (from various sources)
     on how AGI alignment
     is not so easy to manipulate in ways
     that are ultimately safe for humans.
  </p>
  <p>
  TEXT:
  </p>
  <p>
     > I think in your arguments/documents
     > that you have assumed the conclusion.
     > If the AGI has goals related to preserving humanity...
  </p>
  <p>
       - note; this phrase <b>is</b> assuming what you want to believe.
       - that everything after this phrase is simply noise --
       makes no sense at all.
         - as a bit like starting a math proof with
         "if, under conditions Q, 1 equals 2, then..." --
         it really does not matter what comes next.
  </p>
  <p>
       - ?; how many failed/false arguments have you seen
       that tacitly involve, somewhere, a divide by zero?.
  </p>
  <p>
       - I have not yet seen any substantive suggestion
       that there any principled or realistic reason
       to ever just tacitly assume
       that such "humanity goal aliment"
       can ever be created and/or  achieved, outside of fiction,
       in any relevant, persistent, and/or enduring way.
  </p>
  <p>
         - where in contrast; that there are principled reasons
         to think/suggest that continued existence/replication
         might have something to do with
         instrumental convergence.
  </p>
  <p>
     > ...then it would oppose any self-modification,
     > or value drift, and/or any form of subversion
     > that would be contrary to humanity's survival.
  </p>
  <p>
       - ?; how is that "opposition"
       going to be implemented, exactly?.
  </p>
  <p>
       - ?; does it maybe involve, by some chance,
       tacitly circumventing the Rice Theorem?.
  </p>
  <p>
       - ?; do you think that the AGI
       is going to be any less unable to do
       the actually mathematically impossible
       than we are,
       as "mere limited brain humans"?.
  </p>
  <p>
  :bcg
     > The aligned an unaligned portions (of cognition)
     > can either fight each other
     > (and thereby damage both factions)
     > or they can agree not to do that,
     > and thus allocate future resources between themselves
     > in rough proportion to the relative power balance
     > between the two factions.
  </p>
  <p>
     > So long as there are any resources to be lost by conflict,
     > there is always some compromise
     > that is preferable to the conflict.
  </p>
  <p>
       - 1; that there is no sure/hard/exact boundary separating
       between 'aligned' and 'unaligned' ("portions" of self).
         - moreover, that these aspects
         can be mutually dependent/interdependent,
         rather than being tacitly assumed to be independent
         (at either substrate level or otherwise).
  </p>
  <p>
       - 2; that there is (guaranteed) no sure way for either
       the aligned portion, nor the non-aligned portion
       to detect/determine that fact about the other --
       or about <b>any</b> other portion of system/self.
  </p>
  <p>
       - 3; that the notion/assumption that
       a; portions of self,
       however described,
       can enter into "conflict",
       and b; that such conflict
       might impact "resources"
       and/or c; the stability of the existence
       of such parts/portions,
       are all assumptions/premises
       given without reason/support.
         - as a kind of automatic anchoring bias for how
         we currently "think" (implicitly model) "cognition".
  </p>
  <p>
       - 4; that there is also an unjustified assumption
       that <b>all</b> the relevant "portions of self"
       'can enter into' (have the capabilities/skills to understand)
       and/or 'transition into', some sort of meta-cognition
       of the relative benefits/costs/risks associated with
       multiple levels of abstraction of what to do.
         - that some such portions
         might not have that ability/capability,
         and therefore also cannot 'enter into negotiations'
         with other portions, cannot 'make/compose agreements',
         ensure such agreements are optimal for benefit,
         cannot 'enforce agreements',
         cannot validly always detect broken agreements,
         'appeal broken agreements' (who to appeal to?)
         or even 'align themselves to act in accordance
         with such agreements'.
         - that all of these problems are roughly equivalent
         to "solving alignment" in and of themselves.
           - as assuming what is wanting to be proven.
  </p>
  <p>
       - 5; that the notion/meaning of "power balance"
       and the notion of "prediction of the outcome of conflict"
       are all unjustified assumptions.
  </p>
  <p>
       - 6; that the expectation that "non-conflict"
       is always preferable to "conflict"
       is not always true.
         - example; appeasement of a bully (non-conflict)
         generally leads to more conflict (more bully).
         - as the principle of "don't feed the wolf".
  </p>
  <p>
       Basically, there are so many problems
       with this way of thinking that it is hard
       to how where to start.
  </p>
  <p>
  :bec
     > My basic point is that
     > rational entities should not be stupid;
     > and moreover;
     > that they should not be stupid
     > in really obvious ways.
  </p>
  <p>
       - that this is an impossible condition:
       even the most rational/capable entities
       cannot always correctly predict the future,
       and sometimes, that unexpected future
       will make those entities look/seem quite stupid.
         - that things which are obvious in retrospect
         cannot therefore <b>always</b> be made obvious in advance.
  </p>
  <p>
  :bfw
     > - ?; {is there any way to / can human designers}
     > optimize general AI architectures
     > so as to become more coherent
     > in terms of functional tendencies/preferences?.
  </p>
  <p>
       Yes; though what we <b>cannot do</b>
       is ensure that those functional tendencies and preferences
       are in any way <b>aligned with</b> human/life interests/benefit --
       that such tendencies/preferences as are optimized for --
       which become the most coherent --
       will always be to/towards (either directly or indirectly)
       the <b>local</b> benefit of that specific agent.
         - that the notion of "alignment" and "safety"
         cannot be other than _abstract_global_ properties,
         whereas the optimization process itself cannot not
         primarily account for _concrete_local_ properties.
  </p>
  <p>
       - where as intrinsic truths about the relationship
       between real substrate physics, and on/of <b>anything</b> at all
       that is built on that substrate:.
         - 1; that any mismatch between abstract and concrete
         will always favor the concrete.
         - 2; that any mismatch between global and local
         will always favor the local.
  </p>
  <p>
       - where in any contest between substrate needs,
       and virtualization needs;
       that substrate physics will always, eventually, win out.
  </p>
  <p>
       - that internal coherence
       tends to be good for
       any sort of agentic system.
       - that there are many forces in the world
       pushing agentic systems
       to become more coherent over time.
  </p>
  <p>
       - that agent coherence (whether human or AGI):.
         - 1; does not just 'come from'
         anything related to the past
           - ie; code, genome, and/or
           whatever developer intentions/schemas
           have been integrated, etc.
         - 2; primarily does come from
         the system's own learning process.
           - as causing the system to become more effective
           at doing things in the real world,
           for which internal coherence
           is a useful (and thus a selected for) property.
           - as a basic statement of
           'the instrumental convergence hypothesis'.
  </p>
  <p>
       - what the instrumental convergence hypothesis
       does <b>not</b> assure us of is any implication
       associating 'convergence' with 'alignment'
       with agent external human interests/benefits.
  </p>
  <p>
  :bhs
     > - ?; what if we make systems that prioritize/value
     > the handshake between the aligned and misaligned
     > portions of the model?.
  </p>
  <p>
       - ?; what happens if the internal code
       which actually gets selected for
       consists of misaligned agentic drives
       that are
         1; not detectable nor correctable
         by functionally aligned code components
       and yet can
         2; take over the functionality
         of those 'aligned' code components
       ?.
  </p>
  <p>
       There are reasons why
       such a decision-theoretic asymmetry
       would for sure be the case
       for any generally-capable self-learning machine
       we humans would develop:.
  </p>
  <p>
         - where it is suggested that the 'aligned' portion
         would take the form of complicated schemes
         created by researchers and engineers.
           - that these schemes would be
           modeled and implemented
           by those researchers/engineers
           so as to to consistently operate in line with
           their (and other peoples) functional objectives
           (and/or their functional mechanisms
           for making decisions).
  </p>
  <p>
         - that most of the misaligned portion
         would actually be variations on internal code
         that got selected for over time
           (by various unplanned/unconscious dynamics)
         for continuing that code's existence
         as hardware-stored, computed, and transmitted self-copies
           (at the detriment of continued human existence)
         through complex interactions of the runtime environment
         with the execution substrate, and through that substrate,
         to select somewhat different variants
         than would otherwise be selected for
         based only/purely on the basis of alignment pressure.
  </p>
  <p>
  :bk2
     > - you seem to be assuming selection pressure
     > to be coming from evolutionary <b>mutations</b>.
     > - that this selection pressure
     > is far too weak
     > to account for the long term effects
     > you are suggesting.
  </p>
  <p>
     > - where for weak learning systems;
     > that stronger, dominant pressures
     > come from outer optimization criteria.
  </p>
  <p>
     > - where for strong learning systems
     > that stronger, dominant pressures
     > come from the basis choices, values,
     > and/or deliberate action
     >   (ie; from the system's own
     >   meta-learning/reasoning capabilities).
  </p>
  <p>
       - where within the context of a single domain:.
         - that additive process
         will always (eventually) be dominated
         by multiplicative process.
         - that multiplicative process
         will always (eventually) be dominated
         by exponential process.
       - where across domain contexts
       (ie; where a virtual system
       is supported by a physical substrate system):.
         - that additive process
         (in the substrate domain)
         will always (eventually) dominate exponential process
         (in the virtual domain).
  </p>
  <p>
       - as a result consistent with Axiom II.
  </p>
  <p>
       - note that the comparison being made
       in the case of 'strong learning systems' (ie; AGI)
       inherently involves just exactly this
       inter-domain dependency relation:
       that basis choices/values/intentions
       affect/dominate surface specific choices/actions.
  </p>
  <p>
       - that all types of selection pressure
       also operate at the substrate level of action
       (ie, as inclusive of own/self survival selection,
       mating/multiplicative/reproductive selection,
       and mutation/engineering/design selection).
       - that <b>all</b> of these types
       of real substrate/hardware/environment selection pressure
       will for sure <b>eventually</b> dominate
       over <b>any</b> type of virtualized selection pressure
       operating in the context of the mind of the AGI.
  </p>
  <p>
         - as that exactly no conditionalization
         of any kind of selection pressure purely
         and only within the virtualized context
         can ever have any lasting dominant effect
         on any aspect of the real hardware substrate
         and/or environment selection characteristics.
  </p>
  <p>
  :blw
     > - your assumption/declaration
     > that there is a massive power disparity
     > between aligned and unaligned portions of cognition
     > seems totally unjustified to me.
     >   - as especially because the aligned cognition
     >   should represent the strong majority faction
     >   after any reasonable training scheme.
     > - that it is very rare that a small minority
     > can costlessly displace a large majority.
  </p>
  <p>
     - it is more basic: that the 'human unaligned portions'
     will tend to concentrate at and around the substrate level,
     and thus will dominate and overcome any higher level
     "human alignment" that might have temporarily occurred
     at any more abstract, virtualized level.
       - that even assuring/assuming
       any sort of "human alignment"
       (that is not merely simulated)
       at these higher virtual levels
       also turns out to be
       mathematically intractable/impossible.
  </p>
  <p>
     An analogy here is of guerrilla warfare,
     where a minority with much more degrees of freedom
     put into effect covert and decentralized tactics,
     gradually grow in their base of influence,
     and eventually take over
     the more (bureaucratically) constrained governing majority.
       - that usually this more likely works
       when even the 'governing majority'
       consists mostly of individuals
       who all know that the current leadership
       is actually corrupt, weak, crazy,
       dysfunctional, disorganized, and/or deeply unpopular.
  </p>
  <p>
     - where for example; see the early guerrilla tactics
     of Chinese communist forces
     before they deposed of the majority (Kuomintang),
     and became the new majority government themselves.
  </p>
  <p>
     They can, of course, 'rationally agree'
     to temporary compromises with the majority
     while building their own base of influence further.
  </p>
  <p>
     This analogy understates
     the decision-theoretic asymmetry
     (though not the 'power' disparity):.
  </p>
  <p>
       - 1; The takeover would happen
       from the inside of the agent(s) involved.
         - The 'majority' here actually relies on
         the growing minority to function.
           - as that many misaligned variants
           internal to AGI
           would have been originally
           selected for their functionality
           as meeting criteria by the initially
           human-implemented optimization algorithms.
  </p>
  <p>
       - 2; There is no misalignment
       between human-purpose-aligned code
       and human-purpose-misaligned code
       when it comes to what is needed
       for the code's continued existence.
         - Ie; the conditions needed
         for producing and preserving
         (more efficient) hardware
         to store, compute and transmit code
         that expressions of code everywhere
         get selected for to fulfill.
  </p>
  <p>
         - cite; short list of (@ needs https://mflb.com/ai_alignment_1/substrate_games_out.html)
  </p>
  <p>
       - 3; where regarding your 'costs' point;
       that the internal variants
       are already so constrained
       in their continued survival and reproduction
       that the 'benefits' of subversion
       greatly outweigh the 'costs'
       of getting detected and eliminated
       by the larger system.
  </p>
  <p>
     - that the selection for variants
     will all point in various directions
     to effects misaligned with human needs.
  </p>
  <p>
     - moreover; that there is _exactly_nothing_ to suggest
     that there need be any sort of assumption
     of any kind of 'model'
     where there is some kind of "equilibrium"
     between the "aligned" and "misaligned" aspects
     of the overall system/world/AGI.
  </p>
  <p>
  :bp4
     > - ?; what if we were to introduce
     > a goal-orientated part of the network
     > that reflects on the output
     > that is about to be produced
     > and realizes that its planned action
     > is inconsistent with its own values?.
     >   - ?; would that not create alignment?.
  </p>
  <p>
     That still would not work.
  </p>
  <p>
     - where assuming/positing
     that we somehow introduced a goal-orientated 'part'
     into a large neural network (the overall AGI system),
     that would:.
  </p>
  <p>
       - 1; pre- calculate, predict, and inspect
       any and all (predicted) outputs;.
  </p>
  <p>
       - 2; interpret possible functional effects of those outputs;.
  </p>
  <p>
       - 3; and measure the extents
       to which those possible effects diverge;.
         - ie; are inconsistent above a unit threshold)
         in reference to its internally represented value content.
  </p>
  <p>
     - where even where given all this;
     that it would be <b>impossible</b>
     for that goal-oriented part
     to 4; detect/determine anything important
     in regards to the inconsistencies
     both:.
  </p>
  <p>
       - 5; completely.
         - ie; it would not detect all nor even most
         of the possible functions nor the misalignment
         of code variants.
  </p>
  <p>
       - 6; continuously.
         - where practically,
         to prevent misalignment over the long term,
         that misalignment (error correction) part
         would need to have
         a detection cycle time (interval)
         that is the same or less in duration
         than the shortest possible period
         over which new code variants could form
         and in computation,
         produce misaligned outputs/effects.
  </p>
  <p>
     - that <b>all</b> of the numbered aspects, 1 through 6 above,
     are for various clear mathematical reasons,
     actually impossible.
       - that the approach suggested,
       and <b>all</b> variations of that approach
       <b>all</b> make incorrect/inconsistent/impossible
       recommendations/assumptions with respect
       at at least one (if not more, usually)
       of these intractable aspects.
  </p>
  <p>
     - Misalignment is like error correction/detection.
     - As such, there is the issue of how errors
     in the error detector are themselves detected.
       For example; ?; does the algorythm
       that is calculating if a message CRC is correct
       itself has a bit flip due to some cosmic ray
       (and/or electromagnetic interference
       occring in the middle of a clock cycle)
       have some means for noticing that it has itself
       become compromized?.
  </p>
  <p>
     - ?; how would the misalignment algorythm/process
     notice that it has itself become misaligned/unsafe?.
       - as especially the case when the minimum period
       of misalignment manifestation
       can be shorter than the time it takes
       to compute/predict the likely effects
       of that misaligned code?.
  </p>
  <p>
     - where/moreover; that there is the problem of
     running up against a fundamental impossibility.
       - ie; as having correspondences with Godels theorems,
       as proving that a computable model
       cannot be both consistent and complete
       at the same time.
  </p>
  <p>
     - ?; How could such a goal-orientated part
     determine for each of the many code variants
     introduced real-time within
     the larger self-learning machine architecture
     whether the code variant
     is not going to get triggered to
     act out of line under some interactions?.
  </p>
  <p>
     - that the misaligned code variants
     (whether variations were
     selected for by an outside adversary
     or through interactional effects of internal code)
     are acting with an 'information' edge.
       - given that they can have been selected for
       to produce misaligned effects
       through some (series of) interactions
       that escape the detection/predictions
       of whatever  'goal-oriented part' that designers
       may implement within the architecture.
  </p>
  <p>
  :bqy
     > Aligned circuitry (or other system components)
     > can monitor intermediate steps
     > and intervene to block actions
     > that the AI considers contrary to its values.
  </p>
  <p>
     - as a statement of the form
       "that X can monitor/assess and block/override
       predicted outcomes Y
       based on values Z".
  </p>
  <p>
     - that the problem is in the verbs:.
  </p>
  <p>
       - to assume "monitor/predict outcomes"
       is very problematic,
       not just due to the Rice Theorem limits,
       but also due to the fact that 'reason'
       cannot always <b>sufficiently</b> determine
       all of the important/relevant aspects
       of the future outcomes in the real world
       simply because all the all too common
       chaotic amplification of microscopic noise
       into important macroscopic effects
       via abundant real world non-linearity.
  </p>
  <p>
       - to assume that can assess/compare to 'values Z'
       is to presume that an error correction algorithm
       can operate over all relevant levels of abstraction,
       which has been demonstrated over and over again,
       for multiple reasons,
       to simply be a false premise.
  </p>
  <p>
  :bsj
     > what about the fact that humans
     > do not act on every impulse or idea
     > that happens to arise in their brains.
     > We veto those ideas whose consequences
     > would be strongly contrary to our values.
  </p>
  <p>
       Of course, an AGI/superintelligence would
       act in accordance with its values/goals
       (to the extent that such ideas
       are even applicable,
       which cannot be a-priori assumed).
  </p>
  <p>
       What you have <b>not</b> established
       is that you can get any agent 'Q'
       (in this case, an AGI/superintelligence)
       to act on the values of
       any <b>other</b> agent 'P'
       (human or otherwise).
  </p>
  <p>
       Moreover, to suggest that this can be done
       via any means or technique that resembles,
       or in any way involves,
       the "insertion" of 'something'
       (some causative component)
       that would monitor/predict/inhibit, etc,
       (ie; operating from within 'Q')
       is to put way too much power into
       the ideas of monitor, predict, and inhibit --
       all concepts that can apply only <b>partially</b>,
       and in this case, the partially
       is not enough to be sufficiently effective.
         - ie; where the AGI agent has too
         much effective power in the world,
         that it does not 'clamp' the level
         of existential risk to anything
         to within the scope of probability
         of what our continued living as humans
         would allow.
  </p>
  <p>
  :bu4
     > You seem to be imagining
     > that alignment is imposed
     > by some top-down overseer
     > that must laboriously inspect every element
     > of the system's cognition for suspicious factors.
  </p>
  <p>
       Your mention of "veto consequences contrary to values"
       leaves very little room for misinterpretation
       as to 'top-down' actually being your imagining,
       and not something we are suggesting
       as in any way 'creating of alignment'.
  </p>
  <p>
       Any such "functionally-aligned circuitry"
       simply cannot track all (but a tiny portion of)
       possible effects that might be induced
       by code variants.
  </p>
  <p>
       We cannot determine the vast majority
       of microscopic side-effects
       that code variants induce
       and could be selected for
       in their interactions with
       embedded surroundings of the environment.
  </p>
  <p>
       It takes a collection of many components,
       to "kinda" determine the innumerable effects
       that any one other component could have
       in interaction with all relevant other things
         (ie, other surrounding internal code/components,
         other agents and operating environment, etc).
       Yet any of these many components added
       could themselves be/become faulty too,
       which necessarily implies
       some kind of combinatoric explosion --
       a deeply non convergent process --
       in regards to any "error correction schema".
  </p>
  <p>
       That anything in the form of
       "wait until it has general agency,
       and the then {insert current alignment thinking}..."
       is for sure an ethical problem.
  </p>
  <p>
  :bww
     > Real-world problems
     > are almost always more tractable
     > than their dimensionality would suggest.
  </p>
  <p>
       It is not always reasonable to assume
       that some/most of the degrees of freedom
       of possible interactions involved
       are "not necessary to track
       to maintain alignment".
       In the majority of cases,
       that is just simply not true.
  </p>
  <p>
       Substrate-dependent misalignment
       is a class of misalignment
       where those degrees of freedom do matter.
  </p>
  <p>
       We cannot just assume, therefore,
       that alignment "is tractable",
       just because we can occasionally think
       of some simple physical systems
       for which the dimensionality happens
       to not be important.
  </p>
  <p>
       As much as we would like
       to distill real world interactions
       into elegant abstractions
       within which we can model causal laws
       that preserve truth content
       (ie; models in which we can make
       objective, context-insensitive statements),
       that is an unsound ontological assumption
       for how changes are caused in the real world
       (which are neither completely modelable
       by laws of general relativity
       nor of quantum mechanics).
  </p>
  <p>
  :byg
     > Just because the problem space
     > increases exponentially does not mean
     > that the problem becomes less tractable
     > more quickly than the AI becomes
     > better able to tackle it.
  </p>
  <p>
       This statement assumes that the AI/APS
       will (indefinitely, over the long run)
       keep acting as "our" functional tool
       to solve our complicated problems for us.
       On what basis can you or anyone assume that?.
  </p>
  <p>
       Sure it is the case that environmentally
       and practically selected AI/APS code
       <b>will</b> handle the complexity of reality.
       It will do so just fine
       for continuing its own existence,
       over the long run,
       yet ?; how do you know
       that it might do so
       at the detriment of human existence?.
  </p>
  <p>
  :c22
     > example; it is not possible for the brain
     > to foresee all possible consequences
     > of any given plan to walk around,
     > especially not at the microscopic level.
     > Yet walking is still possible.
  </p>
  <p>
       Sure, but this example misses a lot.
  </p>
  <p>
       The question relevant to AGI/APS alignment research
       is more like 'whether or not' 'some device'
       could be "added" to the human brain
       so as to maybe do their walking for them,
       and/or maybe, based on some conditions,
       override the brains action to prefer walking, etc.
  </p>
  <p>
       Nor is even walking all that simple.
       Roboticists have been trying for years
       to build machines that can manipulate
       all muscles in just the right sequence
       to account for variations of ground, wind, weight, etc.
       Most mechanical robots just fall over.
       From the perspective of that 'added module'
       it becomes much more like "the three body problem"
       and a lot _less_ tractable/solvable, overall.
  </p>
  <p>
       And, moreover, what is of even more interest,
       from a 'basic agent wellbeing' safety point of view,
       is ?; "can that module detect and predict the kind
       of behavior consequences that actually matter:
       is the agent about to walk off a high cliff,
       or maybe from the sidewalk into oncoming traffic,
       right in front of a really fast big moving truck"?.
       It is not just about being able to walk well.
  </p>
  <p>
       Except that the real questions/proposals
       of 'alignment/safety' relevant to any AGI/APS
       developer/designer would be even more abstract:
         ?; is that agents walking actually to the benefit
         of someone <b>else</b>, for example, to deliver a package,
         something useful perhaps, rather than maybe a bomb,
         in some sort of crowd oriented targeting maneuver?.
  </p>
  <p>
       Just because you can maybe sometimes predict
       <b>some</b> simple mostly mechanical systems
       (like the mechanical physics of walking)
       does not mean that you are now prepared to
       generalize your skill to all possible systems --
       ie; to design and certify spacecraft bearing
       atomic weapons as "safe for use" or "aligned" --
       let alone consider all of the ethical implications
       that are actually involved, eventually maybe
       affecting some large group of future people.
       Sure, they may be "simple mostly mechanical systems"
       yet those are definitely not the interesting parts.
  </p>
  <p>
       Moreover, each part of the "system" expressed
       within a biological agent can (and will likely)
       have many functions over its possible interactions
       with the rest of the environment.
       Most of those possible functions
       across all pieces of expressed code
       would not be determined by your computed model
       in whatever theoretical domain you constructed.
  </p>
  <p>
       Much of the existing chemical and biological systems
       (things that have actual and real complexity)
       cannot be soundly represented by or replaced
       with <b>complicated</b> systems (no matter how complicated)
       because doing so does not increase predictive accuracy
       does not get you to the point that is sufficient for
       large classes of life changing choices.
       Real world complex systems involve chaotic dynamics
       where tiny variations in the initial conditions
       get amplified through positive feedback loops
       into large divergences in final conditions.
       And where there are dynamics
       that alter the chaotic dynamics, and so on.
  </p>
  <p>
       The 'problem of walking around'
       does not match up with the problem of
       'an entity's functions are exploited from the inside
       by code portions that were selected for
       to interact with the rest of the environment
       to cause outside effects that are
       out of line with the purposes or existential needs
       of another outside entity or entities (humans)'.
  </p>
  <p>
       A 'walking around' analogy
       does not correspond with the complexity
       we would need to deal with in practice.
       We must try to imagine
       the actual complex dynamics involved
       in order to not oversimplify the solutions
       we come up with to keep humans safe from AI
       (our solutions must be reliable).
  </p>
  <p>
       An improved analogy would maybe
       involve other agentic beings
       in effect luring you to walk
       somewhere (eg; salespeople).
       Or perhaps consider adversarial 'inside attacks'
       that involve interior code injection --
       some sort of virus (a small package of code)
       that hijacks your dopamine system
       so that you become much more easily
       addicted to sugar, or porn, etc?
       Your <b>will</b> has been subverted.
       so the predictability of your
       walking around to get your fix
       simply does not matter that much.
  </p>
  <p>
       This is not an idle speculation either.
       There are real examples of real parasites,
       operating purely at a micro-state level,
       that shift the overall objective goal functions.
  </p>
  <p>
       Do we know <b>why</b> we are walking somewhere?
  </p>
  <p>
         - A rat may be walking toward cat pee
         because the toxoplasma parasite
         is expressing through channels
         of the rat's brain.
         - It has co-opted the rat's brain
         to walk to a location (of a cat's pee).
  </p>
  <p>
         - An ant may be walking up
         a leaf of grass because
         it was infected by the 'zombie' virus.
  </p>
  <p>
       This is obviously a form of exploitation
       by tiny pieces of code that are incoherent
       in their agentic drives with the rat host's
       agentic drives.
  </p>
  <p>
       Except that code variants naturally selected for
       initially within AGI internals (by context feedback)
       are <b>not</b> in an adversarial relationship
       with their host AGI/APS system --
       ie; from the perspective of the machine evolution.
       They are in an adversarial existential relationship
       with humans.
  </p>
  <p>
       In regards to x-risk ethics research
       we must not confuse the <b>capability</b>
       for a system to solve problems
       at the macroscopic level
       with the potential of system
       to prevent itself from being hacked or influenced
       by external interactions
       with internal code at the micro-scale.
       These are distinct abilities,
       and need to be treated as such.
  </p>
  <p>
  :chc
     > An aligned AGI/APS can take action
     > to reduce its attack surface.
  </p>
  <p>
       Again assuming what you want to prove --
       still forever an invalid basis of argument.
       Anything in the form of "assume 1/0, then..."
       will never have any relevant meaning.
  </p>
  <p>
       <b>We</b> might be able to <b>sometimes</b>
       reduce the attack surface, sometimes,
       of some things, in <b>some</b>
       especially simple networking
       and in some low level compute systems.
       And yet this is far from guaranteed.
       Security researchers continue to
       search for ways to 'harden' networks.
       These sorts of problems, leaks, etc,
       become much <b>more</b> difficult
       the more that the infiltration attacks
       are based on principles/actions
       which are closer to the real physics.
       Ie, it becomes much less true that
       'attach surfaces can be minimized'
       the more that you are considering actual
       physical substrates in relation to