ThousandMonkeysTypewriter.github.io/blog.html at master · ThousandMonkeysTypewriter/ThousandMonkeysTypewriter.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235

<!DOCTYPE html>
<html lang="en-US">

  <head>
    <meta charset='utf-8'>
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <meta name="viewport" content="width=device-width,maximum-scale=2">
    <meta name="description" content="ThousandMonkeysTypewriter.github.io : ">

    <link rel="stylesheet" type="text/css" media="screen" href="style.css?v=1a4ff4e20d94f254d1ba1a1add0b77cf8298ef1c">
    <link rel="icon" href="http://thousandmonkeystypewriter.github.io/favicon.ico">

    <title> ThousandMonkeysTypewriter </title>
    <meta property="og:title" content="Welcome to GitHub" />
    <meta property="og:locale" content="en_US" />
    <link rel="canonical" href="https://thousandmonkeystypewriter.github.io/" />
    <meta property="og:url" content="https://thousandmonkeystypewriter.github.io/" />
    <meta property="og:site_name" content="ThousandMonkeysTypewriter.github.io" />
    <script type="application/ld+json">
    {"name":"ThousandMonkeysTypewriter.github.io","description":null,"author":null,"@type":"WebSite","url":"https://thousandmonkeystypewriter.github.io/","image":null,"publisher":null,"headline":"Welcome to GitHub","dateModified":null,"datePublished":null,"sameAs":null,"mainEntityOfPage":null,"@context":"http://schema.org"}</script>

  </head>

  <body>

    <!-- HEADER -->
    <div id="header_wrap" class="outer">
        <header class="inner">
          <a id="forkme_banner" href="https://github.com/ThousandMonkeysTypewriter">View on GitHub</a>

          <h1 id="project_title"><a href="https://thousandmonkeystypewriter.github.io">Program code generation using neural networks</a></h1>


        </header>
    </div>

    <!-- MAIN CONTENT -->
    <div id="main_content_wrap" class="outer"  style="background: white">
      <section id="main_content" class="inner">
        <h2><a href="https://habr.com/post/358564/">Automatic test script generation</a> (in Russian)</h2>
        <h2>15.05.2018</h2>
   <br/>
        <br/>
        <br/>
        <br/>
<h2>What is neural program synthesis?</h2>
        <h2>12.04.2018</h2>

<p>In recent years, <a href="https://en.wikipedia.org/wiki/Deep_learning">Deep Learnig</a> has made <a href="https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf">considerable progress</a> in areas such as online advertising,, speech recognition and image recognition.  The success of DL lets us to change the view on the way the software itself is being created. We can use neural nets to gradualy increase automation in the process oft program creation, and help engineers to get more results with less efforts.</p>

<p>There are a great deal of applications for program synthesis. Successful systems could one day
automate a job that is currently very secure for humans: computer programming. Imagine a world
in which debugging, refactoring, translating and synthesizing code from sketches can all be done
without human effort.</p>

<h2 id="what-is-thousand-monkeys-typewriter">What is Thousand Monkeys Typewriter?</h2>

<p>TMT is the system for <a href="https://arxiv.org/abs/1703.07469">program induction</a> that generates simple scripts in a Domain-specifil language. The system combines <a href="https://en.wikipedia.org/wiki/Supervised_learning">supervised</a> and <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised</a> learning. The core is the <a href="https://arxiv.org/abs/1511.06279">Neural Programmer-Interpreter</a>, is capable of abstraction and higher-order controls over the program. The system works for error detection in both user logs and software source sode.</p>

<p>TMT also incorporates most common conceprions used today in a field of program synthesis are <a href="http://rsta.royalsocietypublishing.org/content/375/2104/20150403">satisfiability modulo theories (SMT) and counter-example-guided inductive synthesis (CEGIS)</a>.</p>

<h3 id="types-of-data">Types of data</h3>

<p>There are two types of data (logs) that we are analyzing:</p>

<ul>
  <li>user logs</li>
  <li>program traces</li>
</ul>

<h3 id="supervised-and-unsupervised">Supervised and unsupervised</h3>

<p>To analyze logs, we are using both unsupervised technique (<a href="https://arxiv.org/pdf/1802.03903.pdf">Donut</a> for user logs), and supervised (engineers mark anomalies in software traces using j-unit tests).</p>

<h3 id="npi">NPI</h3>

<p>NPI is the core of the system. It takes logs and traces and learns probabilities at each timestep and environment.</p>

<p>Neural Programmer (NPI) consists of:</p>
<ol>
  <li><a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">RNN</a> controller that takes sequential state encodings built from (a) the world environment
(changes with actions), (b) the program call (actions) and (c) the arguments for the called
program. The entirety of the input is fed in the first timestep, so every action by the NPI
creates an output that is delivered as input.</li>
  <li><a href="https://github.com/ThousandMonkeysTypewriter/DomainSpecificLanguage">DSL functions</a></li>
  <li>Domain itself where functions are executed (“scratchpad”)</li>
</ol>

<p><img src="https://thousandmonkeystypewriter.github.io/npi.gif" alt="NPI illustration" /></p>

<p>At the time, TMT generates simple scripts for anomaly detection in production logs.</p>

<h2 id="how-generator-works"><a name="scripts"></a>How generator works</h2>

<h3 id="data">DATA</h3>

<p>At the moment, we analyze three types of logs: user logs, database logs, software traces.</p>

<h3 id="detect-anomalies">Detect anomalies</h3>

<p>Then, we are trying to detect any problems that logs contain. What exactly are anomalies? Simply put, an anomaly is any deviation from standard behavior.</p>

<p>Normal data representation:
<img src="https://thousandmonkeystypewriter.github.io/Picture1.png" alt="data" /></p>

<p>Point anomalies, which are anomalies in a single value in the data:
<img src="https://thousandmonkeystypewriter.github.io/Picture2.png" alt="data" /></p>

<p>Query execution time anomalies:
<img src="https://thousandmonkeystypewriter.github.io/log.png" alt="detectum" /></p>

<p>We are aimed to detect anomalies in situtations such as: memory leaks, bottlenecks in Java runtime, server infrastructure problems etc.</p>

<p>As a result, we acquire training data, either labeled manually (supervised), or labeled by automatic classificator (unsupervised).</p>

<h3 id="train-neural-programmer">Train Neural Programmer</h3>

<p>After we get a list with labeled normal and abnormal events, we train our core to differ what’s normal and what’s not in trhe future.</p>

<p>In case of unsupervised learning, the process can be described as “one neural net teaching another”:</p>

<p>event in log was labeled as normal:
<img src="https://thousandmonkeystypewriter.github.io/scheme/normal_log.png" alt="detectum" /></p>

<p>event in log was labeled as abnormal:
<img src="https://thousandmonkeystypewriter.github.io/scheme/anomaly_log.png" alt="detectum" /></p>

<p>db query was labeled as normal:
<img src="https://thousandmonkeystypewriter.github.io/scheme/normal_db.png" alt="detectum" /></p>

<p>db query was labeled as abnormal:
<img src="https://thousandmonkeystypewriter.github.io/scheme/anomaly_db.png" alt="detectum" /></p>

<p>In some cases, where situations by default are labeled as normal, we have only to decide what command to call next.
<img src="https://thousandmonkeystypewriter.github.io/scheme/npi_only.png" alt="detectum" /></p>

<h3 id="working-with-the-scrpits-in-runtime">Working with the scrpits in runtime</h3>

<p>Having trained NPI means that, at each step, we have a predicted operation from argumeents and environment. Thus we expect from a well  trained model to predict each command and each step, indicating whether this observed sutuation in logs (software traces) is normal or not. If normal, we expect one outcome, of not - another.</p>

<p>In other words, the model would predict an outcome from given state: label (by default, “normal”), argument and environment. Each combination of this parameters could produce different outcomes.</p>

<p>sample normal runtime script with environment:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN
DIFF
DIFF
CHECK
MO_ALARM
</code></pre></div></div>

<p>alert runtime script:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN
DIFF
DIFF
CHECK
ALARM
</code></pre></div></div>

<p>Data environment</p>

<p><code class="highlighter-rouge">DIFF ({'program': {'program': 'diff', 'id': 6}, 'environment': {'date1': 15, 'output': 0, 'answer': 2, 'terminate': False, 'client_id': 2, 'date2': 20, 'date2_diff': 45, 'date1_diff': 93}, 'args': {'id': 29}})</code></p>

<h3 id="challenges">Challenges</h3>

<p>One of the problems with NPIs is that we can only measure the generalization by running the trained NPI on various environments and observing the results. And as we explained earlier, every change of the peremeters can produce a new script.</p>

<p>For the sake of simplicity, we want co create on scipt that will cover many (all) situations:
<img src="https://thousandmonkeystypewriter.github.io/scheme/general.png" alt="detectum" /></p>

<p>This means that we are still have to make a module that will merge all possible cripts from this particular NPI into the smallest number of scripts possible, preferably one script.</p>

<h3 id="examples">Examples</h3>

<ul>
  <li><img src="https://thousandmonkeystypewriter.github.io/detectum.png" alt="detectum" /><a href="https://github.com/ThousandMonkeysTypewriter/GeneratedScripts/tree/master/web/detectum">web/detectum/logs</a></li>
  <li><img src="https://thousandmonkeystypewriter.github.io/yandex.png" alt="yandex" /><a href="https://github.com/ThousandMonkeysTypewriter/GeneratedScripts/tree/master/db/yandex">db/yandex/clickhouse/logs</a>
<!-- ![facebook](https://thousandmonkeystypewriter.github.io/facebook.png)[app/facebook/swift/logs](https://github.com/ThousandMonkeysTypewriter/GeneratedScripts/tree/master/app/facebook/swift/logs) --></li>
</ul>

<h2 id="references">References</h2>

<p><a href="https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf">Deep Learning: A Critical Appraisal</a></p>

<p><a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Andrej Karpathy: Software 2.0</a></p>

<p><a href="https://www.microsoft.com/en-us/research/publication/neuro-symbolic-program-synthesis-2/">Neuro-Symbolic Program Synthesis</a></p>

<p><a href="https://arxiv.org/abs/1802.02696">Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction</a></p>

<p><a href="https://github.com/src-d/awesome-machine-learning-on-source-code/">A curated list of awesome machine learning frameworks and algorithms that work on top of source code</a></p>

<h2 id="neural-programmer-concepts"><a name="npc"></a>Neural programmer concepts</h2>

<p><a href="https://arxiv.org/abs/1703.07469">RobustFill (Microsoft)</a></p>

<p><a href="https://openreview.net/pdf?id=ByldLrqlx">DeepCoder (Microsoft)</a></p>

<p><a href="https://arxiv.org/abs/1801.03526">Program Synthesis with Reinforcement Learning (Google)</a></p>

<p><a href="https://arxiv.org/abs/1703.05698">Bayou (https://github.com/capergroup/bayou)</a></p>

<p><a href="https://openreview.net/forum?id=Skp1ESxRZ">Tree-to-tree parser</a></p>

<p><a href="https://arxiv.org/abs/1712.07388">Kayak (DiffBlue)</a></p>

<h2 id="anomaly-detection"><a name="npc"></a>Anomaly detection</h2>

<p><a href="https://arxiv.org/abs/1802.03903">Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications
</a></p>

<p><a href="https://arxiv.org/abs/1804.02998">Anomaly Detection for Industrial Big Data
</a></p>

<p><a href="https://arxiv.org/abs/1804.03065">Faster Anomaly Detection via Matrix Sketching
</a></p>

<p><img src="https://thousandmonkeystypewriter.github.io/220px-Chimpanzee_seated_at_typewriter.jpg" alt="Our monkey" /></p>


      </section>
    </div>

    <!-- FOOTER  -->
    <div id="footer_wrap" class="outer">
      <footer class="inner">

        <p>Published with <a href="https://pages.github.com">GitHub Pages</a></p>
      </footer>
    </div>


  </body>
</html>