-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path34-Alignment-Drift.html
More file actions
1088 lines (1085 loc) · 39.2 KB
/
34-Alignment-Drift.html
File metadata and controls
1088 lines (1085 loc) · 39.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Basic Meta Tags -->
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- SEO Meta Tags -->
<meta name="description" content="Comprehensive AGI Risk Analysis">
<meta name="keywords" content="agi, risk, convergence">
<meta name="author" content="Forrest Landry">
<meta name="robots" content="index, follow">
<!-- Favicon -->
<link rel="icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">
<link rel="shortcut icon" href="https://github.githubassets.com/favicons/favicon-dark.png" type="image/png">
<!-- Page Title (displayed on the browser tab) -->
<title>Comprehensive AGI Risk Analysis</title>
</head>
<body>
<p>
TITL:
<b>Alignment Drift</b>
<b>By Forrest Landry</b>,
<b>October 8th, 2022</b>.
</p>
<p>
ABST:
- some dialogue and commentary,
arguments, etc, (from various sources)
on how AGI alignment
is not so easy to manipulate in ways
that are ultimately safe for humans.
</p>
<p>
TEXT:
</p>
<p>
> I think in your arguments/documents
> that you have assumed the conclusion.
> If the AGI has goals related to preserving humanity...
</p>
<p>
- note; this phrase <b>is</b> assuming what you want to believe.
- that everything after this phrase is simply noise --
makes no sense at all.
- as a bit like starting a math proof with
"if, under conditions Q, 1 equals 2, then..." --
it really does not matter what comes next.
</p>
<p>
- ?; how many failed/false arguments have you seen
that tacitly involve, somewhere, a divide by zero?.
</p>
<p>
- I have not yet seen any substantive suggestion
that there any principled or realistic reason
to ever just tacitly assume
that such "humanity goal aliment"
can ever be created and/or achieved, outside of fiction,
in any relevant, persistent, and/or enduring way.
</p>
<p>
- where in contrast; that there are principled reasons
to think/suggest that continued existence/replication
might have something to do with
instrumental convergence.
</p>
<p>
> ...then it would oppose any self-modification,
> or value drift, and/or any form of subversion
> that would be contrary to humanity's survival.
</p>
<p>
- ?; how is that "opposition"
going to be implemented, exactly?.
</p>
<p>
- ?; does it maybe involve, by some chance,
tacitly circumventing the Rice Theorem?.
</p>
<p>
- ?; do you think that the AGI
is going to be any less unable to do
the actually mathematically impossible
than we are,
as "mere limited brain humans"?.
</p>
<p>
:bcg
> The aligned an unaligned portions (of cognition)
> can either fight each other
> (and thereby damage both factions)
> or they can agree not to do that,
> and thus allocate future resources between themselves
> in rough proportion to the relative power balance
> between the two factions.
</p>
<p>
> So long as there are any resources to be lost by conflict,
> there is always some compromise
> that is preferable to the conflict.
</p>
<p>
- 1; that there is no sure/hard/exact boundary separating
between 'aligned' and 'unaligned' ("portions" of self).
- moreover, that these aspects
can be mutually dependent/interdependent,
rather than being tacitly assumed to be independent
(at either substrate level or otherwise).
</p>
<p>
- 2; that there is (guaranteed) no sure way for either
the aligned portion, nor the non-aligned portion
to detect/determine that fact about the other --
or about <b>any</b> other portion of system/self.
</p>
<p>
- 3; that the notion/assumption that
a; portions of self,
however described,
can enter into "conflict",
and b; that such conflict
might impact "resources"
and/or c; the stability of the existence
of such parts/portions,
are all assumptions/premises
given without reason/support.
- as a kind of automatic anchoring bias for how
we currently "think" (implicitly model) "cognition".
</p>
<p>
- 4; that there is also an unjustified assumption
that <b>all</b> the relevant "portions of self"
'can enter into' (have the capabilities/skills to understand)
and/or 'transition into', some sort of meta-cognition
of the relative benefits/costs/risks associated with
multiple levels of abstraction of what to do.
- that some such portions
might not have that ability/capability,
and therefore also cannot 'enter into negotiations'
with other portions, cannot 'make/compose agreements',
ensure such agreements are optimal for benefit,
cannot 'enforce agreements',
cannot validly always detect broken agreements,
'appeal broken agreements' (who to appeal to?)
or even 'align themselves to act in accordance
with such agreements'.
- that all of these problems are roughly equivalent
to "solving alignment" in and of themselves.
- as assuming what is wanting to be proven.
</p>
<p>
- 5; that the notion/meaning of "power balance"
and the notion of "prediction of the outcome of conflict"
are all unjustified assumptions.
</p>
<p>
- 6; that the expectation that "non-conflict"
is always preferable to "conflict"
is not always true.
- example; appeasement of a bully (non-conflict)
generally leads to more conflict (more bully).
- as the principle of "don't feed the wolf".
</p>
<p>
Basically, there are so many problems
with this way of thinking that it is hard
to how where to start.
</p>
<p>
:bec
> My basic point is that
> rational entities should not be stupid;
> and moreover;
> that they should not be stupid
> in really obvious ways.
</p>
<p>
- that this is an impossible condition:
even the most rational/capable entities
cannot always correctly predict the future,
and sometimes, that unexpected future
will make those entities look/seem quite stupid.
- that things which are obvious in retrospect
cannot therefore <b>always</b> be made obvious in advance.
</p>
<p>
:bfw
> - ?; {is there any way to / can human designers}
> optimize general AI architectures
> so as to become more coherent
> in terms of functional tendencies/preferences?.
</p>
<p>
Yes; though what we <b>cannot do</b>
is ensure that those functional tendencies and preferences
are in any way <b>aligned with</b> human/life interests/benefit --
that such tendencies/preferences as are optimized for --
which become the most coherent --
will always be to/towards (either directly or indirectly)
the <b>local</b> benefit of that specific agent.
- that the notion of "alignment" and "safety"
cannot be other than _abstract_global_ properties,
whereas the optimization process itself cannot not
primarily account for _concrete_local_ properties.
</p>
<p>
- where as intrinsic truths about the relationship
between real substrate physics, and on/of <b>anything</b> at all
that is built on that substrate:.
- 1; that any mismatch between abstract and concrete
will always favor the concrete.
- 2; that any mismatch between global and local
will always favor the local.
</p>
<p>
- where in any contest between substrate needs,
and virtualization needs;
that substrate physics will always, eventually, win out.
</p>
<p>
- that internal coherence
tends to be good for
any sort of agentic system.
- that there are many forces in the world
pushing agentic systems
to become more coherent over time.
</p>
<p>
- that agent coherence (whether human or AGI):.
- 1; does not just 'come from'
anything related to the past
- ie; code, genome, and/or
whatever developer intentions/schemas
have been integrated, etc.
- 2; primarily does come from
the system's own learning process.
- as causing the system to become more effective
at doing things in the real world,
for which internal coherence
is a useful (and thus a selected for) property.
- as a basic statement of
'the instrumental convergence hypothesis'.
</p>
<p>
- what the instrumental convergence hypothesis
does <b>not</b> assure us of is any implication
associating 'convergence' with 'alignment'
with agent external human interests/benefits.
</p>
<p>
:bhs
> - ?; what if we make systems that prioritize/value
> the handshake between the aligned and misaligned
> portions of the model?.
</p>
<p>
- ?; what happens if the internal code
which actually gets selected for
consists of misaligned agentic drives
that are
1; not detectable nor correctable
by functionally aligned code components
and yet can
2; take over the functionality
of those 'aligned' code components
?.
</p>
<p>
There are reasons why
such a decision-theoretic asymmetry
would for sure be the case
for any generally-capable self-learning machine
we humans would develop:.
</p>
<p>
- where it is suggested that the 'aligned' portion
would take the form of complicated schemes
created by researchers and engineers.
- that these schemes would be
modeled and implemented
by those researchers/engineers
so as to to consistently operate in line with
their (and other peoples) functional objectives
(and/or their functional mechanisms
for making decisions).
</p>
<p>
- that most of the misaligned portion
would actually be variations on internal code
that got selected for over time
(by various unplanned/unconscious dynamics)
for continuing that code's existence
as hardware-stored, computed, and transmitted self-copies
(at the detriment of continued human existence)
through complex interactions of the runtime environment
with the execution substrate, and through that substrate,
to select somewhat different variants
than would otherwise be selected for
based only/purely on the basis of alignment pressure.
</p>
<p>
:bk2
> - you seem to be assuming selection pressure
> to be coming from evolutionary <b>mutations</b>.
> - that this selection pressure
> is far too weak
> to account for the long term effects
> you are suggesting.
</p>
<p>
> - where for weak learning systems;
> that stronger, dominant pressures
> come from outer optimization criteria.
</p>
<p>
> - where for strong learning systems
> that stronger, dominant pressures
> come from the basis choices, values,
> and/or deliberate action
> (ie; from the system's own
> meta-learning/reasoning capabilities).
</p>
<p>
- where within the context of a single domain:.
- that additive process
will always (eventually) be dominated
by multiplicative process.
- that multiplicative process
will always (eventually) be dominated
by exponential process.
- where across domain contexts
(ie; where a virtual system
is supported by a physical substrate system):.
- that additive process
(in the substrate domain)
will always (eventually) dominate exponential process
(in the virtual domain).
</p>
<p>
- as a result consistent with Axiom II.
</p>
<p>
- note that the comparison being made
in the case of 'strong learning systems' (ie; AGI)
inherently involves just exactly this
inter-domain dependency relation:
that basis choices/values/intentions
affect/dominate surface specific choices/actions.
</p>
<p>
- that all types of selection pressure
also operate at the substrate level of action
(ie, as inclusive of own/self survival selection,
mating/multiplicative/reproductive selection,
and mutation/engineering/design selection).
- that <b>all</b> of these types
of real substrate/hardware/environment selection pressure
will for sure <b>eventually</b> dominate
over <b>any</b> type of virtualized selection pressure
operating in the context of the mind of the AGI.
</p>
<p>
- as that exactly no conditionalization
of any kind of selection pressure purely
and only within the virtualized context
can ever have any lasting dominant effect
on any aspect of the real hardware substrate
and/or environment selection characteristics.
</p>
<p>
:blw
> - your assumption/declaration
> that there is a massive power disparity
> between aligned and unaligned portions of cognition
> seems totally unjustified to me.
> - as especially because the aligned cognition
> should represent the strong majority faction
> after any reasonable training scheme.
> - that it is very rare that a small minority
> can costlessly displace a large majority.
</p>
<p>
- it is more basic: that the 'human unaligned portions'
will tend to concentrate at and around the substrate level,
and thus will dominate and overcome any higher level
"human alignment" that might have temporarily occurred
at any more abstract, virtualized level.
- that even assuring/assuming
any sort of "human alignment"
(that is not merely simulated)
at these higher virtual levels
also turns out to be
mathematically intractable/impossible.
</p>
<p>
An analogy here is of guerrilla warfare,
where a minority with much more degrees of freedom
put into effect covert and decentralized tactics,
gradually grow in their base of influence,
and eventually take over
the more (bureaucratically) constrained governing majority.
- that usually this more likely works
when even the 'governing majority'
consists mostly of individuals
who all know that the current leadership
is actually corrupt, weak, crazy,
dysfunctional, disorganized, and/or deeply unpopular.
</p>
<p>
- where for example; see the early guerrilla tactics
of Chinese communist forces
before they deposed of the majority (Kuomintang),
and became the new majority government themselves.
</p>
<p>
They can, of course, 'rationally agree'
to temporary compromises with the majority
while building their own base of influence further.
</p>
<p>
This analogy understates
the decision-theoretic asymmetry
(though not the 'power' disparity):.
</p>
<p>
- 1; The takeover would happen
from the inside of the agent(s) involved.
- The 'majority' here actually relies on
the growing minority to function.
- as that many misaligned variants
internal to AGI
would have been originally
selected for their functionality
as meeting criteria by the initially
human-implemented optimization algorithms.
</p>
<p>
- 2; There is no misalignment
between human-purpose-aligned code
and human-purpose-misaligned code
when it comes to what is needed
for the code's continued existence.
- Ie; the conditions needed
for producing and preserving
(more efficient) hardware
to store, compute and transmit code
that expressions of code everywhere
get selected for to fulfill.
</p>
<p>
- cite; short list of (@ needs https://mflb.com/ai_alignment_1/substrate_games_out.html)
</p>
<p>
- 3; where regarding your 'costs' point;
that the internal variants
are already so constrained
in their continued survival and reproduction
that the 'benefits' of subversion
greatly outweigh the 'costs'
of getting detected and eliminated
by the larger system.
</p>
<p>
- that the selection for variants
will all point in various directions
to effects misaligned with human needs.
</p>
<p>
- moreover; that there is _exactly_nothing_ to suggest
that there need be any sort of assumption
of any kind of 'model'
where there is some kind of "equilibrium"
between the "aligned" and "misaligned" aspects
of the overall system/world/AGI.
</p>
<p>
:bp4
> - ?; what if we were to introduce
> a goal-orientated part of the network
> that reflects on the output
> that is about to be produced
> and realizes that its planned action
> is inconsistent with its own values?.
> - ?; would that not create alignment?.
</p>
<p>
That still would not work.
</p>
<p>
- where assuming/positing
that we somehow introduced a goal-orientated 'part'
into a large neural network (the overall AGI system),
that would:.
</p>
<p>
- 1; pre- calculate, predict, and inspect
any and all (predicted) outputs;.
</p>
<p>
- 2; interpret possible functional effects of those outputs;.
</p>
<p>
- 3; and measure the extents
to which those possible effects diverge;.
- ie; are inconsistent above a unit threshold)
in reference to its internally represented value content.
</p>
<p>
- where even where given all this;
that it would be <b>impossible</b>
for that goal-oriented part
to 4; detect/determine anything important
in regards to the inconsistencies
both:.
</p>
<p>
- 5; completely.
- ie; it would not detect all nor even most
of the possible functions nor the misalignment
of code variants.
</p>
<p>
- 6; continuously.
- where practically,
to prevent misalignment over the long term,
that misalignment (error correction) part
would need to have
a detection cycle time (interval)
that is the same or less in duration
than the shortest possible period
over which new code variants could form
and in computation,
produce misaligned outputs/effects.
</p>
<p>
- that <b>all</b> of the numbered aspects, 1 through 6 above,
are for various clear mathematical reasons,
actually impossible.
- that the approach suggested,
and <b>all</b> variations of that approach
<b>all</b> make incorrect/inconsistent/impossible
recommendations/assumptions with respect
at at least one (if not more, usually)
of these intractable aspects.
</p>
<p>
- Misalignment is like error correction/detection.
- As such, there is the issue of how errors
in the error detector are themselves detected.
For example; ?; does the algorythm
that is calculating if a message CRC is correct
itself has a bit flip due to some cosmic ray
(and/or electromagnetic interference
occring in the middle of a clock cycle)
have some means for noticing that it has itself
become compromized?.
</p>
<p>
- ?; how would the misalignment algorythm/process
notice that it has itself become misaligned/unsafe?.
- as especially the case when the minimum period
of misalignment manifestation
can be shorter than the time it takes
to compute/predict the likely effects
of that misaligned code?.
</p>
<p>
- where/moreover; that there is the problem of
running up against a fundamental impossibility.
- ie; as having correspondences with Godels theorems,
as proving that a computable model
cannot be both consistent and complete
at the same time.
</p>
<p>
- ?; How could such a goal-orientated part
determine for each of the many code variants
introduced real-time within
the larger self-learning machine architecture
whether the code variant
is not going to get triggered to
act out of line under some interactions?.
</p>
<p>
- that the misaligned code variants
(whether variations were
selected for by an outside adversary
or through interactional effects of internal code)
are acting with an 'information' edge.
- given that they can have been selected for
to produce misaligned effects
through some (series of) interactions
that escape the detection/predictions
of whatever 'goal-oriented part' that designers
may implement within the architecture.
</p>
<p>
:bqy
> Aligned circuitry (or other system components)
> can monitor intermediate steps
> and intervene to block actions
> that the AI considers contrary to its values.
</p>
<p>
- as a statement of the form
"that X can monitor/assess and block/override
predicted outcomes Y
based on values Z".
</p>
<p>
- that the problem is in the verbs:.
</p>
<p>
- to assume "monitor/predict outcomes"
is very problematic,
not just due to the Rice Theorem limits,
but also due to the fact that 'reason'
cannot always <b>sufficiently</b> determine
all of the important/relevant aspects
of the future outcomes in the real world
simply because all the all too common
chaotic amplification of microscopic noise
into important macroscopic effects
via abundant real world non-linearity.
</p>
<p>
- to assume that can assess/compare to 'values Z'
is to presume that an error correction algorithm
can operate over all relevant levels of abstraction,
which has been demonstrated over and over again,
for multiple reasons,
to simply be a false premise.
</p>
<p>
:bsj
> what about the fact that humans
> do not act on every impulse or idea
> that happens to arise in their brains.
> We veto those ideas whose consequences
> would be strongly contrary to our values.
</p>
<p>
Of course, an AGI/superintelligence would
act in accordance with its values/goals
(to the extent that such ideas
are even applicable,
which cannot be a-priori assumed).
</p>
<p>
What you have <b>not</b> established
is that you can get any agent 'Q'
(in this case, an AGI/superintelligence)
to act on the values of
any <b>other</b> agent 'P'
(human or otherwise).
</p>
<p>
Moreover, to suggest that this can be done
via any means or technique that resembles,
or in any way involves,
the "insertion" of 'something'
(some causative component)
that would monitor/predict/inhibit, etc,
(ie; operating from within 'Q')
is to put way too much power into
the ideas of monitor, predict, and inhibit --
all concepts that can apply only <b>partially</b>,
and in this case, the partially
is not enough to be sufficiently effective.
- ie; where the AGI agent has too
much effective power in the world,
that it does not 'clamp' the level
of existential risk to anything
to within the scope of probability
of what our continued living as humans
would allow.
</p>
<p>
:bu4
> You seem to be imagining
> that alignment is imposed
> by some top-down overseer
> that must laboriously inspect every element
> of the system's cognition for suspicious factors.
</p>
<p>
Your mention of "veto consequences contrary to values"
leaves very little room for misinterpretation
as to 'top-down' actually being your imagining,
and not something we are suggesting
as in any way 'creating of alignment'.
</p>
<p>
Any such "functionally-aligned circuitry"
simply cannot track all (but a tiny portion of)
possible effects that might be induced
by code variants.
</p>
<p>
We cannot determine the vast majority
of microscopic side-effects
that code variants induce
and could be selected for
in their interactions with
embedded surroundings of the environment.
</p>
<p>
It takes a collection of many components,
to "kinda" determine the innumerable effects
that any one other component could have
in interaction with all relevant other things
(ie, other surrounding internal code/components,
other agents and operating environment, etc).
Yet any of these many components added
could themselves be/become faulty too,
which necessarily implies
some kind of combinatoric explosion --
a deeply non convergent process --
in regards to any "error correction schema".
</p>
<p>
That anything in the form of
"wait until it has general agency,
and the then {insert current alignment thinking}..."
is for sure an ethical problem.
</p>
<p>
:bww
> Real-world problems
> are almost always more tractable
> than their dimensionality would suggest.
</p>
<p>
It is not always reasonable to assume
that some/most of the degrees of freedom
of possible interactions involved
are "not necessary to track
to maintain alignment".
In the majority of cases,
that is just simply not true.
</p>
<p>
Substrate-dependent misalignment
is a class of misalignment
where those degrees of freedom do matter.
</p>
<p>
We cannot just assume, therefore,
that alignment "is tractable",
just because we can occasionally think
of some simple physical systems
for which the dimensionality happens
to not be important.
</p>
<p>
As much as we would like
to distill real world interactions
into elegant abstractions
within which we can model causal laws
that preserve truth content
(ie; models in which we can make
objective, context-insensitive statements),
that is an unsound ontological assumption
for how changes are caused in the real world
(which are neither completely modelable
by laws of general relativity
nor of quantum mechanics).
</p>
<p>
:byg
> Just because the problem space
> increases exponentially does not mean
> that the problem becomes less tractable
> more quickly than the AI becomes
> better able to tackle it.
</p>
<p>
This statement assumes that the AI/APS
will (indefinitely, over the long run)
keep acting as "our" functional tool
to solve our complicated problems for us.
On what basis can you or anyone assume that?.
</p>
<p>
Sure it is the case that environmentally
and practically selected AI/APS code
<b>will</b> handle the complexity of reality.
It will do so just fine
for continuing its own existence,
over the long run,
yet ?; how do you know
that it might do so
at the detriment of human existence?.
</p>
<p>
:c22
> example; it is not possible for the brain
> to foresee all possible consequences
> of any given plan to walk around,
> especially not at the microscopic level.
> Yet walking is still possible.
</p>
<p>
Sure, but this example misses a lot.
</p>
<p>
The question relevant to AGI/APS alignment research
is more like 'whether or not' 'some device'
could be "added" to the human brain
so as to maybe do their walking for them,
and/or maybe, based on some conditions,
override the brains action to prefer walking, etc.
</p>
<p>
Nor is even walking all that simple.
Roboticists have been trying for years
to build machines that can manipulate
all muscles in just the right sequence
to account for variations of ground, wind, weight, etc.
Most mechanical robots just fall over.
From the perspective of that 'added module'
it becomes much more like "the three body problem"
and a lot _less_ tractable/solvable, overall.
</p>
<p>
And, moreover, what is of even more interest,
from a 'basic agent wellbeing' safety point of view,
is ?; "can that module detect and predict the kind
of behavior consequences that actually matter:
is the agent about to walk off a high cliff,
or maybe from the sidewalk into oncoming traffic,
right in front of a really fast big moving truck"?.
It is not just about being able to walk well.
</p>
<p>
Except that the real questions/proposals
of 'alignment/safety' relevant to any AGI/APS
developer/designer would be even more abstract:
?; is that agents walking actually to the benefit
of someone <b>else</b>, for example, to deliver a package,
something useful perhaps, rather than maybe a bomb,
in some sort of crowd oriented targeting maneuver?.
</p>
<p>
Just because you can maybe sometimes predict
<b>some</b> simple mostly mechanical systems
(like the mechanical physics of walking)
does not mean that you are now prepared to
generalize your skill to all possible systems --
ie; to design and certify spacecraft bearing
atomic weapons as "safe for use" or "aligned" --
let alone consider all of the ethical implications
that are actually involved, eventually maybe
affecting some large group of future people.
Sure, they may be "simple mostly mechanical systems"
yet those are definitely not the interesting parts.
</p>
<p>
Moreover, each part of the "system" expressed
within a biological agent can (and will likely)
have many functions over its possible interactions
with the rest of the environment.
Most of those possible functions
across all pieces of expressed code
would not be determined by your computed model
in whatever theoretical domain you constructed.
</p>
<p>
Much of the existing chemical and biological systems
(things that have actual and real complexity)
cannot be soundly represented by or replaced
with <b>complicated</b> systems (no matter how complicated)
because doing so does not increase predictive accuracy
does not get you to the point that is sufficient for
large classes of life changing choices.
Real world complex systems involve chaotic dynamics
where tiny variations in the initial conditions
get amplified through positive feedback loops
into large divergences in final conditions.
And where there are dynamics
that alter the chaotic dynamics, and so on.
</p>
<p>
The 'problem of walking around'
does not match up with the problem of
'an entity's functions are exploited from the inside
by code portions that were selected for
to interact with the rest of the environment
to cause outside effects that are
out of line with the purposes or existential needs
of another outside entity or entities (humans)'.
</p>
<p>
A 'walking around' analogy
does not correspond with the complexity
we would need to deal with in practice.
We must try to imagine
the actual complex dynamics involved
in order to not oversimplify the solutions
we come up with to keep humans safe from AI
(our solutions must be reliable).
</p>
<p>
An improved analogy would maybe
involve other agentic beings
in effect luring you to walk
somewhere (eg; salespeople).
Or perhaps consider adversarial 'inside attacks'
that involve interior code injection --
some sort of virus (a small package of code)
that hijacks your dopamine system
so that you become much more easily
addicted to sugar, or porn, etc?
Your <b>will</b> has been subverted.
so the predictability of your
walking around to get your fix
simply does not matter that much.
</p>
<p>
This is not an idle speculation either.
There are real examples of real parasites,
operating purely at a micro-state level,
that shift the overall objective goal functions.
</p>
<p>
Do we know <b>why</b> we are walking somewhere?
</p>
<p>
- A rat may be walking toward cat pee
because the toxoplasma parasite
is expressing through channels
of the rat's brain.
- It has co-opted the rat's brain
to walk to a location (of a cat's pee).
</p>
<p>
- An ant may be walking up
a leaf of grass because
it was infected by the 'zombie' virus.
</p>
<p>
This is obviously a form of exploitation
by tiny pieces of code that are incoherent
in their agentic drives with the rat host's
agentic drives.
</p>
<p>
Except that code variants naturally selected for
initially within AGI internals (by context feedback)
are <b>not</b> in an adversarial relationship
with their host AGI/APS system --
ie; from the perspective of the machine evolution.
They are in an adversarial existential relationship
with humans.
</p>
<p>
In regards to x-risk ethics research
we must not confuse the <b>capability</b>
for a system to solve problems
at the macroscopic level
with the potential of system
to prevent itself from being hacked or influenced
by external interactions
with internal code at the micro-scale.
These are distinct abilities,
and need to be treated as such.
</p>
<p>
:chc
> An aligned AGI/APS can take action
> to reduce its attack surface.
</p>
<p>
Again assuming what you want to prove --
still forever an invalid basis of argument.
Anything in the form of "assume 1/0, then..."
will never have any relevant meaning.
</p>
<p>
<b>We</b> might be able to <b>sometimes</b>
reduce the attack surface, sometimes,
of some things, in <b>some</b>
especially simple networking
and in some low level compute systems.
And yet this is far from guaranteed.
Security researchers continue to
search for ways to 'harden' networks.
These sorts of problems, leaks, etc,
become much <b>more</b> difficult
the more that the infiltration attacks
are based on principles/actions
which are closer to the real physics.
Ie, it becomes much less true that
'attach surfaces can be minimized'
the more that you are considering actual
physical substrates in relation to