-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcase-study.html
More file actions
1587 lines (1573 loc) · 81.9 KB
/
case-study.html
File metadata and controls
1587 lines (1573 loc) · 81.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html data-wf-page="5f71dd169010d6326b65485d">
<head>
<meta charset="utf-8" />
<title>Tapestry • Case Study</title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
<link href="assets/css/style.css" rel="stylesheet" type="text/css" />
<script src="https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js" type="text/javascript"></script>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Inter:regular,500,600,700" media="all" />
<script type="text/javascript">
WebFont.load({ google: { families: ["Inter:regular,500,600,700"] } });
</script>
<script type="text/javascript">
!(function (o, c) {
var n = c.documentElement,
t = " w-mod-";
(n.className += t + "js"),
("ontouchstart" in o ||
(o.DocumentTouch && c instanceof DocumentTouch)) &&
(n.className += t + "touch");
})(window, document);
</script>
<link href="assets/images/tapestry_graphic_mono.svg" rel="shortcut icon" type="image/x-icon" />
<link href="assets/images/tapestry_graphic_mono.svg" rel="apple-touch-icon" />
<script src="https://kit.fontawesome.com/d019875f94.js" crossorigin="anonymous"></script>
<meta name="image" property="og:image" content="assets/images/tapestry_logo_color.svg" />
</head>
<body>
<div class="navigation-wrap">
<div data-collapse="medium" data-animation="default" data-duration="400" role="banner" class="navigation w-nav">
<div class="navigation-container">
<div class="navigation-left">
<a href="/" aria-current="page" class="brand w-nav-brand w--current" aria-label="home">
<img src="assets/images/tapestry_graphic_color.svg" alt="" class="template-logo">
</a>
<nav role="navigation" class="nav-menu w-nav-menu">
<a href="/case-study" class="link-block w-inline-block">
<div>Case Study</div>
</a>
<a href="/team" class="link-block w-inline-block">
<div>The Team</div>
</a>
</nav>
</div>
<div class="navigation-right">
<div class="login-buttons">
<a href="https://github.com/tapestry-pipeline" target="_blank">
<span style="color: #161d6f">
<i class="fab fa-github fa-lg"></i>
</span>
</a>
</div>
</div>
</div>
<div class="w-nav-overlay" data-wf-ignore="" id="w-nav-overlay-0"></div>
</div>
</div>
<div id="sidebar" class="toc">
</div>
<div class="section header">
<article class="container case-study-container">
<div class="hero-text-container">
<h1 class="h1 centered">Case Study</h1>
</div>
<div id="case-study">
<br />
<br />
<!-- Section 1 -->
<h2 class="h2">1 Introduction</h2>
<br>
<p>
As the quantity and diversity of data continue to grow, user data is becoming spread across a variety of
third-party services catering to the needs of specific departments within a company (e.g., marketing, customer
service, finance). Unfortunately, this places data in silos, fragmenting what should be a more unified user
view.
</p>
<br>
<p>
With the advent of the cloud-native data warehouse has come the ability to collect data from disparate sources
into one central repository and treat the warehouse as the “single source of truth”.
</p>
<br>
<p>
Data modeling and business intelligence or analytics tools built on top of the warehouse can now offer a
consistent view of users. However, mapping this data back into the same third-party tools to impact day-to-day
operations is still a huge challenge. In other words, there is a lag between insights and operations.
</p>
<br />
<h3>1.1 What is Tapestry?</h3>
<br />
<p>
Tapestry is an open-source orchestration framework for the deployment of user entity data pipelines.
</p>
<figure>
<div data-w-id="c8b52af7-58af-18c7-65ce-aa2950b6601f" class="hero-images-container case-study-image"><img
src="assets/images/demo/30_kickstart_full.gif" alt="" class="hero-image-left"
style="will-change: transform; transform: translate3d(0px, 0px, 0px) scale3d(1, 1, 1) rotateX(0deg) rotateY(0deg) rotateZ(0deg) skew(0deg, 0deg); transform-style: preserve-3d;"><img
src="assets/images/demo/tap_ui_grouparoo_v2.png" alt="" class="hero-image-right">
</div>
<figcaption>Tapestry CLI and Dashboard</figcaption>
</figure>
</br>
<p>
Tapestry allows developers to easily configure and launch an end-to-end data pipeline hosted on Amazon Web
Services. With Tapestry, your user data pipeline will be automatically deployed and ready to ingest data from
all your sources, store and transform that data in a warehouse, and sync it back to your tools for immediate
use.
</p>
<br />
<p>
To illustrate just exactly what Tapestry does, let’s use the example of a company named Rainshield.
</p>
<br />
<h3>1.2 Hypothetical</h3>
<figure>
<img src="assets/images/introduction/4_Rainshield_business.png" class="case-study-image">
</figure>
<br />
<p>
Rainshield is a company in the umbrella space, with a thriving e-commerce store that is quickly gaining
traction amongst a larger user base. This has attracted the interest of investors, and they’ve raised funding
that has allowed them to expand.
</p>
<br />
<p>
The small team that started Rainshield is no longer able to wear all hats. They are beginning to staff larger
departments, such as sales, marketing, and customer support, to manage the influx of new business they’re
experiencing.
</p>
<br />
<h3>1.3 SaaS Bloat</h3>
<figure>
<img src="assets/images/introduction/5_Rainshield_SaaS.png" class="case-study-image">
</figure>
<br />
<p>
As the business evolves, Rainshield begins to utilize various SaaS tools to engage with their user base in new
ways and to meet the daily operational needs of different departments.
</p>
<br />
<p>
For example, they begin to use Stripe to handle all of their online transactions. The company’s sales team has
started using Salesforce to organize and track their leads, and the Rainshield customer support team is now
incorporating Zendesk to help with managing all of the support tickets that are being generated. The marketing
team is even planning on hosting a Zoom webinar on a design-your-own umbrella product they are about to unveil
soon!
</p>
<br />
<p>
Prior to this growth, user data was largely managed in the production database. However, these teams are not
only requiring different views of this data to accomplish their goals but are also creating new sources of
user data as they interact with customers through a variety of platforms and tools.
</p>
<br />
<h3>1.4 Data Silos</h3>
<br />
<p>While these third-party SaaS tools cater to the needs of each department well, user data is beginning to
proliferate the organization, in terms of both the data produced and collected. Data is becoming scattered
across the different tools each team is using.
</p>
<figure>
<img src="assets/images/introduction/6_data_silos_gif_alt.gif" class="case-study-image" alt="">
<figcaption>Data captured by different tools live in data silos.</figcaption>
</figure>
<br />
<p>
These tools were not designed with integration in mind, and it is becoming more challenging to have a unified
understanding of a single user and how they are interacting with Rainshield and its product. Each tool has
access to only a portion of the customer’s information, but not the whole picture.
</p>
<br />
<p>
These SaaS tools, that have increased productivity, have now become known as data silos. <strong>Data goes in,
but it doesn’t come out</strong>.
</p>
<br />
<p>
The industry agrees that this is a challenge:
</p>
<figure>
<img src="assets/images/introduction/7_colinzima.png" class="case-study-image" alt="">
</figure>
<h3>1.5 User Data</h3>
<br />
<p>
To better understand the ramifications of data silos, let's turn back to our company Rainshield.
</p>
<br />
<figure>
<img src="assets/images/introduction/8_Susy.png" class="case-study-image" alt="">
<figcaption>A customer's data is fragmented by third-party tools.</figcaption>
</figure>
<br />
<p>
Meet Susy, a Rainshield customer. According to her profile in Salesforce, Susy has purchased six umbrellas for
her friends and family. However, Zendesk indicated that she called two times complaining about the color of
some of her umbrellas. Susy is excited about picking her own umbrella color and is signed up to attend the
Zoom webinar unveiling the new design-your-own product. When this data about Susy lives in different tools, it
becomes difficult to access a composite picture of Susy.
</p>
<br />
<h3>1.6 Analyzing and Leveraging User Data</h3>
<br />
<p>
Susy's data lives in only three tools, but it's easy to imagine that Rainshield could use many more tools that
capture different pieces of data for other users. They might have thousands of different customers buying
their umbrellas, some of whom fit Susy’s exact profile. When data can be collected in one place, Rainshield
can begin to see patterns among these users.
</p>
<figure>
<img src="assets/images/introduction/9_rainshield_users.png" class="case-study-image" alt="">
<figcaption>Aggregating user profiles and finding patterns.</figcaption>
</figure>
<br />
<p>
But even with this user data gathered for analysis in one place, how can these insights be used to impact
day-to-day operations?
</p>
<figure>
<img src="assets/images/introduction/10_insights_action.png" class="case-study-image" alt="">
<figcaption>Turning insights into action.</figcaption>
</figure>
<br />
<p>
Let’s say Rainshield would like to use what they know about Susy and other customers that match her profile in
an attempt to increase sales. They believe that this group of users will be especially interested in the new
umbrella colors that Rainshield just rolled out, and they would like to prompt these customers with a custom
chat message via Intercom the next time they log on. But before this desired action can take place, Rainshield
still needs to provide Intercom with this specific list of customers. In other words, making insights
actionable still requires work. Rainshield needs to map relevant user data to Intercom in the particular
format that Intercom requires. This process can be thought of as <strong>data syncing</strong>.
</p>
<br />
<figure>
<img src="assets/images/introduction/11_users_to_intercom.png" class="case-study-image" alt="">
<figcaption>Syncing user data to third-party tools.</figcaption>
</figure>
<br />
<p>
So let’s take a step back and recap the obstacles that Rainshield and companies like it are facing. Important
user data is being trapped in silos as the quantity of Rainshield’s SaaS tools increases.
</p>
<br />
<p>Companies would like to: </p>
<ul>
<li><em>Aggregate</em> data from these disparate third-party sources into one location for better analysis.
</li>
<li><em>Sync</em> relevant data to other third-party destinations to drive operations based on their findings.
</li>
</ul>
<br />
<h3>1.7 Challenges of User Data Integration</h3>
<br />
<p>
There are several challenges of this type of data integration. User data stored in SaaS applications is
similar in structure to the data we see in traditional relational databases. However, unlike relational
databases, data in SaaS applications cannot be accessed with a simple query.
</p>
<figure>
<img src="assets/images/introduction/data_query.png" alt="" class="case-study-image-small">
<figcaption>User data acccessed via simple query to production database.</figcaption>
</figure>
<br />
<p>
Instead, this data must be retrieved via unique REST APIs, making it difficult to determine how to communicate
with each tool.
</p>
<figure>
<img src="assets/images/introduction/data_api_horizontal.png" alt="" class="case-study-image">
<figcaption>User data acccessed via call to REST API.</figcaption>
</figure>
<br />
<p>
Furthermore, factors such as limited documentation, rate limits on API requests, managing potential network
errors, and ever-changing API schemas can make transporting large amounts of data a challenging and slow
process.
</p>
<br />
<!-- Section 2 -->
<h2>2 Existing Solutions</h2>
<br />
<p>
There are four main options for integrating data between third-party tools:
</p>
<ol>
<li>Manually move files between tools</li>
<li>Use pre-built connectors</li>
<li>Create custom connectors</li>
<li>Build a complete data pipeline</li>
</ol>
<br />
<h3>2.1 Manually Move Files</h3>
<br />
<p>Let’s talk about the first option at Rainshield’s disposal, manually moving files between tools.
</p>
<figure>
<img src="assets/images/solutions/16_csvfile_gif_sm.gif" class="case-study-image" alt="">
<figcaption>Manually exporting CSV files.</figcaption>
</figure>
<br />
<p>
Let’s say that Rainshield wanted to make sure that Salesforce had all of the contacts from the Zoom webinar
revealing the new design-your-own umbrella.
</p>
<br />
<p>
They could simply export the list of webinar attendees from Zoom to a CSV file and then import that file into
Salesforce. However, this might result in duplicate data and could become tedious if you had to do this task
often.
</p>
<br />
<h3>2.2 Use Pre-built Connectors</h3>
<br />
<p>
Another possibility would be to use a company that creates these connectors for you, like Zapier. After
inputting some information about their Zoom and Salesforce accounts, Rainshield can choose from a menu of
pre-built connectors to set up the flow of data between them.
</p>
<figure>
<img src="assets/images/solutions/17_zapier.png" class="case-study-image">
<figcaption>Zapier's Connectors.</figcaption>
</figure>
<br />
<p>
This, however, would not allow much flexibility regarding which parts of the data would be shared between the
two apps. There may still be duplicate data, or the selection of apps available may not fit all use cases.
</p>
<br />
<p>
Another type of pre-built connection exists in some tools' settings. For example, Zoom can integrate directly
with Salesforce by simply configuring your settings to export your data. This isn’t always the case though,
and more than likely, Rainshield will not find connectors for every tool it uses.
</p>
<br />
<h3>2.3 Create Custom Connectors</h3>
<br />
<p>
The third course of action is that Rainshield could designate one or two software engineers to begin building
custom connections to pipe data directly into all of the tools they use. The benefit of this option is you can
flexibly choose what data to send. However, these engineers would have to research these tools' APIs and write
connectors not only to extract data, but also to sync data as well.
</p>
<br />
<p>
This might not be too bad if the number of tools the company used was very small. For example, if Rainshield
only needed to connect Zoom and Salesforce with Mailchimp, then they may only have to write a few connectors
to ensure they all shared the same data.
</p>
<figure>
<img src="assets/images/solutions/18_threesources.png" class="case-study-image">
<figcaption>Custom connectors with only three tools.</figcaption>
</figure>
<br />
<p>
However, if your company already uses several tools or plans on growing in the future, this can quickly get
out of hand. And this is to say nothing of the fact that these connectors would also have to be maintained.
</p>
<figure>
<img src="assets/images/solutions/18_custom_connectors_crop.gif">
<figcaption>Exponential growth from an increasing number of custom API connectors.</figcaption>
</figure>
<br />
<p>
If one API changed, every tool that connected to it would also need to be changed. And the reality is that
even small companies use anywhere between 10 to 50 tools. That would require a lot of valuable engineering
time.
</p>
<br />
<h3>2.4 Build a Complete Data Pipeline</h3>
<br />
<p>
The best and most complete solution is our last option, implementing an end-to-end user data pipeline, with a
cloud data warehouse at the center.
</p>
<br />
<figure>
<img src="assets/images/solutions/3_pipelineoverviewcolor.png" class="case-study-image">
<figcaption>A complete user data pipeline.</figcaption>
</figure>
<br />
<p>
From an engineering perspective, this approach allows access to data ingestion and syncing tools that remove
the headache of working with third-party APIs, while also providing you with the flexibility that custom
connectors offer, i.e. you choose exactly what data to send.
</p>
<br />
<h4>Benefits of Warehouse-Centric Pipelines</h4>
<figure>
<img src="assets/images/solutions/19_warehousecenter_gif.gif"">
<figcaption>Warehouse-centric data integration.</figcaption>
</figure>
<br/>
<p>
Placing a cloud warehouse at the center of your pipeline also provides several other benefits.
</p>
<br/>
<p>Cloud data warehouses:</p>
<ul>
<li>Allow for the storage of large amounts of data with very little infrastructure management and instant scalability.</li>
<li>Serve as a <em>single source of truth</em> if two departments ever had conflicting data.</li>
<li>Enable the use of data modeling and analytics tools, which can aid in making important business intelligence decisions.</li>
<li>Offer the ability to combine and filter data from multiple sources in order to sync with another destination.</li>
</ul>
<br/>
<p>
This warehouse-centric pipeline helps aggregate all of your data into one accessible place so you can create unified models and sync them to the tools your teams need.
</p>
<figure>
<img src=" assets/images/solutions/20_rudderstackquote.png" class="case-study-image" alt="">
</figure>
<h3>2.5 Data Pipeline Solutions</h3>
<br />
<p>
When deploying this type of user data pipeline, the two primary options are to use a proprietary hosted
solution, such as the one offered by the company Rudderstack, or to use open-source tools to configure your
own pipeline.
</p>
<figure>
<img src="assets/images/solutions/21_rudderstack_v_selfhosted.png" class="case-study-image" alt="">
</figure>
<p>
The self-hosted option allows for the inclusion of fully open-source tools and grants you full ownership over
your pipeline infrastructure. This means you can customize it any way you like. But this solution doesn’t have
any out-of-the-box features and would require a substantial amount of engineering time.
</p>
<br />
<p>
Rudderstack, on the other hand, only offers open-source event streaming, and also requires you to use their
infrastructure, leaving you with little control. However, they provide a ton of features and also abstract
away all of the infrastructure provisioning and management, making it extremely easy to deploy a user data
pipeline quickly.
</p>
<br />
<h4>Challenges of Pipeline Deployment</h4>
<br />
<p>
Building your own pipeline requires an extraordinary amount of time and effort to set up, provision, and
configure.
</p>
<br />
<p>
You need to make many different decisions about which tools to use for data ingestion and syncing and which
warehouse to select, and that’s not even mentioning all that goes into provisioning and maintaining pipeline
infrastructure. To say the least, this is an extremely complex process.
</p>
<br />
<h2>3 Tapestry's Solution</h2>
<br />
<figure>
<img src="assets/images/solutions/23_tapestry_comparision.png" alt="">
</figure>
<br />
<p>
Tapestry is for developers who want full control over their data infrastructure, but without having to
provision that infrastructure themselves. Tapestry is a completely open-source framework that automates the
entire pipeline deployment process. We do not, however, have very many out-of-the-box features.
</p>
<br />
<p>
Tapestry weaves together all of the necessary resources to create an end-to-end user data pipeline, automate
the setup and configuration, and let you spend your valuable time doing something more important.
</p>
<br />
<h3>3.1 What Tapestry Automates</h3>
<br />
<p>
So if you are thinking about rolling your own self-hosted solution, but want to simplify the deployment
process, we might be able to help.
</p>
<br />
<figure>
<img src="assets/images/solutions/24_automation_chart.png" class="case-study-image" alt="">
</figure>
<p>
Tapestry automates many steps and creates a number of resources for each phase of the pipeline. As you can
see, deploying your own user data pipeline would require at least 71 steps and the provisioning of 49
resources between AWS and the data warehouse.
</p>
<br />
<!-- Section 3-->
<h2>4 Tapestry's Architecture</h2>
<br />
<p>
This is what Tapestry’s pipeline looks like once deployed.
</p>
<figure>
<img src="assets/images/architecture/38_Tapestry_Final_Architecture_withheadings.png">
<figcaption>Tapestry's Final Architecture</figcaption>
</figure>
</br>
<p>
Before diving into the specifics of this architecture, let’s quickly revisit the three phases of our pipeline:
Ingestion, Storage & Transformation, and Syncing.
</p>
<figure>
<img src="assets/images/solutions/3_pipelineoverviewcolor.png" class="case-study-image">
<figcaption>A complete user data pipeline.</figcaption>
</figure>
</br>
<p>
The ingestion phase is where data is extracted from various sources and is loaded into a data warehouse. Once
in the warehouse, this raw data is then stored and is available to manipulate or transform in any way needed.
Often transformation is needed at this step so that the data can match the schema of the final destination.
The last phase is syncing this data into external tools that can then perform designated actions.
</p>
<br />
<h3>4.1 Data Ingestion</h3>
<figure>
<img src="assets/images/architecture/40_pipelineoverview_ingestion_color.png" class="case-study-image">
</figure>
<br />
<p>
An effective data extraction tool will contain and manage a library of API connectors, specific to each
source. This management of connectors abstracts away the maintenance required to grab data from ever-changing
API endpoints. In addition, this tool should allow for scheduling data extraction and keeping track of global
state so that only new data is pulled.
</p>
<br />
<h4>Flow of Data: ETL vs. ELT</h4>
<br />
<p>
In order to make a decision regarding data ingestion, it’s important to consider the path by which the data
travels.
</p>
<br />
<p>
In the past, storing data was an expensive endeavor. This made it more cost-effective to perform any sort of
data transformations before loading the data into a database or warehouse to reduce the amount of data being
stored. This approach is known as Extract-Transform-Load, commonly referred to as ETL.
</p>
<figure>
<img src="assets/images/architecture/41_ETL.png" class="case-study-image">
</figure>
</br>
<p>
However, with the advent of cloud data warehouses, the costs of storing data have decreased dramatically. This
makes it more feasible to store <em>all</em> of your user data as raw data and to perform any transformations
at the warehouse level to fit a variety of analytic and operational needs. And since transformations aren’t
required first, data can be loaded extremely fast. This approach is known as Extract-Load-Transform, or ELT.
</p>
<figure>
<img src="assets/images/architecture/41_ELT.png" class="case-study-image">
</figure>
<br />
<p>
Since it was vital for our pipeline to have access to all of the raw data we chose to go with an ELT solution.
</p>
<br />
<h4>Data Ingestion Tool: Airbyte</h4>
<figure>
<img src="assets/images/architecture/42_airbyte_card.png" class="case-study-image">
</figure>
<p>
While many data ingestion tools are available, like Fivetran, Stitch, and Meltano, we ultimately went with
Airbyte. We liked that it was open-source, had standardized API connectors, a robust UI, and strong community
support.
</p>
<br />
<h4>Data Ingestion: How We Deploy Airbyte</h4>
<br />
<p>
Using Airbyte, we’re able to extract raw data from many third-party tools through its library of managed API
wrappers, covering the E and L steps of ELT. Both the Airbyte application itself, as well as each of its
connectors, all run on their own individual Docker containers. Airbyte provides the Docker image to deploy
their application, and Tapestry configures the warehouse as a destination for Airbyte via a series of API
calls.
</p>
<br />
<figure>
<img src="assets/images/architecture/43_dataingestion_dockerconnectors_v2.png">
<br />
<br />
<figcaption>Airbyte runs as a Docker container and creates additional connectors as containers.</figcaption>
</figure>
<br />
<p>
Essentially the main application’s container needs to be able to create new Docker containers as users set up
more and more connections. Due to this necessity, Airbyte recommends the use of an AWS EC2 instance as a
virtual private server for hosting.
</p>
<figure>
<img src="assets/images/architecture/43_dataingestion_EC2_ALB_purple_v4.png">
<br />
<br />
<figcaption>Airbyte data ingestion tool, deployed on an EC2 Instance with an Application Load Balancer.
</figcaption>
</figure>
<br />
<p>
While we might have preferred to use a container orchestration service to horizontally scale the computing
resources used by each container, an EC2 instance still allows for vertical scaling of the entire instance.
</p>
<br />
<p>
Placing a load balancer in front of Airbyte means traffic cannot reach the EC2 instance directly. Network
traffic must first pass through the load balancer before it’s routed to the Airbyte instance. This allows us
to take advantage of additional security measures and keep the IP address of the actual instance hidden. This
keeps the instance safe from any port scanning attacks and also takes advantage of AWS’s built-in protection
from DDOS attacks.
</p>
<br />
<h3>4.2 Data Storage and Transformation</h3>
<br />
<p>
The next phase of a data pipeline is data storage and transformation.
</p>
<figure>
<img src="assets/images/architecture/46_pipelineoverview_warehouse_color.png" class="case-study-image">
</figure>
<br />
<h4>Data Warehouse: Snowflake</h4>
<br />
<p>
We’ve already determined that at the center of our pipeline should sit a warehouse that is capable of handling
large amounts of data from a variety of sources. Given our decision to host our tools on AWS services, a
warehouse that could be seamlessly integrated with AWS was preferable. While there are many options for a data
warehouse, such as Google BigQuery, Amazon Redshift, and Microsoft Azure, we chose Snowflake.
</p>
<figure>
<img src="assets/images/architecture/47_snowflakecard.png" class="case-study-image">
</figure>
<br />
<p>
Snowflake can be built on most major cloud platforms, providing valuable flexibility. It also separates
storage needs from computing, also known as query processing. This allows companies to take advantage of cost
savings as well as enable us to scale those two responsibilities independently. Finally, Snowflake abstracts
away the provisioning and maintenance of all the necessary resources for a cloud data warehouse.
</p>
<br />
<h4>Data Storage: Using a Storage Bucket</h4>
<p>
Initially, we attempted to load data directly into Snowflake from third party tools, but we found the data
transfer to be particularly slow. This led us to investigate using a staging area with Snowflake and how this
impacts data loading. </p>
</p>
<figure>
<img src="assets/images/architecture/48_stagingbucket_purple2.png">
<figcaption>Tapestry's data storage with a staging bucket.</figcaption>
</figure>
<p>
Without this staging area, Airbyte can only insert one row of data at a time into Snowflake, requiring
numerous SQL INSERT commands to copy over an entire table. With the addition of a staging area, Airbyte can
achieve efficient bulk data loading.
</p>
<br />
<p>
To implement this staging area, we provision an Amazon S3 staging bucket between our Airbyte instance and our
Snowflake data warehouse.
</p>
<br />
<h4>Data Transformation Tool: DBT</h4>
<figure>
<img src="assets/images/architecture/52_dbt_card.png" class="case-study-image">
</figure>
<p>
A data transformation tool should be flexible so you can transform data to meet a variety of analytical and
operational needs.
</p>
<br />
<p>
Ideally, we would like a SQL-based data transformation tool that could be utilized by non-developers to create
data models based on the warehouse and put that data into action more quickly.
</p>
<br />
<p>
Finally, we would like a tool that maintains a history of our data transformations. Documentation about
existing data models and how these models relate to each other can provide better context for how data has
been manipulated over time.
</p>
<br />
<p>
When considering these requirements for a transformation tool, one option stood out because it encompassed all
features we wanted and was free and open-source. That tool was DBT, or Data Build Tool. We opted to go with
the cloud version of DBT because of its ease of use and simple to understand UI.
</p>
<br />
<h4>Data Transformation: How We Use DBT</h4>
<figure>
<img src="assets/images/architecture/53_warehousearchitecturev3.png" class="case-study-image-small">
<figcaption>Data aggregation with DBT.</figcaption>
</figure>
<br />
<p>
In particular, Tapestry uses DBT to aggregate data and handle duplicate entries. Other transformations you
might want to perform include changing column names or copying only the particular fields you need from one
table into a new table.
</p>
<br />
<p>
Because DBT has its own cloud version, Tapestry doesn’t need to provision any resources for it.
</p>
<br />
<h3>4.3 Data Syncing</h3>
<figure>
<img src="assets/images/architecture/54_pipelineoverview_syncing_color.png" class="case-study-image">
</figure>
<br />
<p>
The syncing phase is where we send data back into external tools that can then act on the data.
</p>
<br />
<p>
Much like data ingestion, this requires a library of API connectors, specific to each destination, and the
ability to schedule when you want to transfer your data. However, data syncing is more challenging than
ingestion in that the data must conform to the destination’s schema.
</p>
<br />
<p>
This concept of syncing data back into your tools is relatively new and has recently been coined
<strong>reverse ETL</strong>. If you recall, ETL and ELT are concerned with moving data from your tools into
your warehouse, and reverse ETL moves data out of your warehouse and into your business’s operational tools.
This term, however, while becoming quite common, does not describe the process well, which is why we prefer
the term “data syncing.”
</p>
<br />
<h4>Data Syncing Tool: Grouparoo</h4>
<figure>
<img src="assets/images/architecture/55_grouparoo_card.png" class="case-study-image">
</figure>
<p>
While proprietary tools exist in the data syncing space, like Census and Hightouch, we opted to find one that
was open-source.
</p>
<br />
<p>
We ultimately went with Grouparoo. It allows us to schedule data syncing into external tools and validate that
the data conforms to the destination schema.
</p>
<br />
<h4>Data Syncing: How we Deploy Grouparoo</h4>
<br />
<p>
Grouparoo recommends deploying their web application stack with an application layer and a data layer.
</p>
<figure>
<img src="assets/images/architecture/56_grouparoo_recommended_arch.png" class="case-study-image-small">
<figcaption>Grouparoo's recommended deployment strategy.</figcaption>
</figure>
<br />
<p>
The application layer is where a worker server and web server will reside. When a request to sync data up with
external sources comes in, it first hits a load balancer which directs the request to the web server. From
there, the web server can then off-load the task to the worker server if the task will take a long time to
complete. When these slower jobs are run in the background, it improves the responsiveness of the web server.
</p>
<br />
<p>
The data layer houses the application database as well as the cache. Grouparoo recommends using Redis to serve
as both the cache and also as the background-job queue for the worker server. They also suggest using Postgres
as the database where your user data will be stored.
</p>
<br />
<p>
This architecture informed how Tapestry chose to deploy Grouparoo to the cloud. Given the distributed nature
of this architecture, we thought it was appropriate to deploy each component as its own Docker container.
</p>
<figure>
<img src="assets/images/architecture/57_datasyncing_docker.png">
<figcaption>Multi-container Docker deployment.</figcaption>
</figure>
<br />
<p>
This way each container would have only one concern, and the decoupling of responsibilities would make it
easier to horizontally scale in the future. First, Tapestry provides a generic Redis and Postgres Docker image
to run the containers for the data layer. Then, Grouparoo provides a Docker image that can be used for
deploying the web and worker service. Starting with their base image, we add the necessary configuration to
integrate Snowflake as the data warehouse for Grouparoo to use as its primary data source.
</p>
<br />
<p>
Of note, Grouparoo uses Javascript or JSON files to store configuration details. Because of this, any
configuration changes require the Grouparoo Docker image to be rebuilt. So we chose to push the image we
provide to a repository on AWS’s Elastic Container Registry, giving the user easy and private access for any
future updates.
</p>
<br />
<figure>
<img src="assets/images/architecture/57_datasyncing_ECR_blue.png">
<figcaption>Docker images for web and worker servers stored in Elastic Container Registry.</figcaption>
</figure>
<br />
<p>
Because this is a multi-container deployment, Tapestry had to consider how best to handle container
orchestration. Some popular options for container orchestration include Kubernetes, Docker Swarm, and Amazon
Elastic Container Service, or ECS. Kubernetes and Docker Swarm are both highly configurable; however, the
learning curve is steep. So we decided to use ECS to handle container orchestration for Tapestry’s Grouparoo
deployment because it manages and scales containers efficiently and automatically. This choice also gave us
the ability to use the recently rolled-out ECS and Docker Compose integration, which simplified this process
even more.
</p>
<figure>
<img src="assets/images/architecture/58_docker_ecs_v2.png" class="large-image">
<br />
<br />
<figcaption>Using Docker Compose and Elastic Container Service integration.</figcaption>
</figure>
<br />
<p>
Docker Compose is a tool that allows developers to define and run multi-container applications via a YAML
file. With this integration, we could seamlessly use this same docker-compose file to deploy the Grouparoo
application and all its dependencies as an ECS cluster. AWS resources are created automatically based on the
specifications in this file. This works because there are built-in mappings between the Docker containers
defined in the file and ECS tasks.
</p>
<br />
<p>
ECS not only manages these containers, but the servers they live on as well. This occurs via AWS Fargate, a
service that abstracts away server provisioning and handles it entirely on the user’s behalf. We also placed a
load balancer in front of the ECS cluster for all of the same security reasons we placed one in front of our
ingestion tool. Additionally, since we use a load balancer, we are also set up nicely to horizontally scale in
the future if needed.
</p>
<figure>
<img src="assets/images/architecture/59_data_syncing_fullv4.png" alt="">
<figcaption>Grouparoo deployed with ECS and Fargate, with the addition of an Application Load Balancer.
</figcaption>
</figure>
<br />
<p>
Once Grouparoo is deployed, you are ready to start pulling data from your warehouse, and syncing it into other
third party tools, like Mailchimp.
</p>
<br />
<!--Section 5-->
<h2>5 Using Tapestry</h2>
<br />
<h3>5.1 Prerequisites & Installing Tapestry</h3>
<p>
Getting started with Tapestry is pretty simple. You will need the following:
</p>
<ul>
<li>Node and NPM</li>
<li>An AWS account and AWS CLI</li>
<li>Docker</li>
</ul>
<br />
<p>
If you were the developer, you would first need Node and NPM installed since Tapestry is a Node package. Since
Tapestry provisions several AWS resources, you are required to have an AWS account and the AWS Command Line
Interface configured on your machine. Finally, you will need to have a Docker account and have it installed on
your machine.
</p>
<br />
<p>
After these preliminary steps, all you would need to do to get started is run <code
class="command">npm i -g tapestry-pipeline</code>, and a host of commands will be provided to you.
</p>
<br />
<h3>5.2 Tapestry Commands</h3>
<figure>
<img src="assets/images/demo/26_commandlist_v2.png">
<figcaption>Tapestry's list of commands.</figcaption>
</figure>
<br />
<p>
As a new user, the first Tapestry command you would run is <code class="command">tapestry init</code>.
</p>
<figure>
<img src="assets/images/demo/27_tapestry_init_computer.png" class="case-study-image-small">
<figcaption>Tapestry provides a CloudFormation template during the init command.</figcaption>
</figure>
<br />
<p>
With <code class="command">tapestry init</code>, you give your project a name, and Tapestry will provision a
project folder along with an AWS CloudFormation template. This template allows you to provision and configure
AWS resources with code. In particular, this template is used to provision resources for the data ingestion
phase of the pipeline. What Tapestry provides for the syncing phase of your pipeline is dependent upon which
command you run next.
</p>
<br />
<figure>
<img src="assets/images/demo/27_init_v2.gif">
<figcaption>Running <em>tapestry init</em> from the command line.</figcaption>
</figure>
<h3>5.3 Deploy vs. Kickstart Commands</h3>
<br />
<p>
Next, you have a choice between the <code class="command">tapestry deploy</code> or <code
class="command">tapestry kickstart</code> commands. Once you make your selection, Tapestry provides all the
necessary configuration files for the data syncing phase.
</p>
<figure>
<img src="assets/images/demo/28_deploy_v_kickstart.png" class="case-study-image">
</figure>
<br />
<p>
Both commands automate the deployment of a fully operational pipeline, but <code
class="command">kickstart</code> includes two pre-configured sources, Zoom and Salesforce, along with one
destination, Mailchimp. These pre-configured third-party tools set up your pipeline to have immediate
end-to-end data flow, beginning with data ingestion and ending with data syncing into these tools. Regardless
of which command you choose, note that a Snowflake account is required for <em>both</em> <code
class="command">deploy</code> and <code class="command">kickstart</code>.
</p>
<br />
<h3>5.4 End-to-End Demo</h3>
<br />
<p>
To better show the full flow of data through a Tapestry pipeline, this section will walk through our <code
class="command">tapestry kickstart</code> command.
</p>
<figure>
<img src="assets/images/demo/29_tapestry_kickstart_computer.png" class="case-study-image-small">
</figure>
<br />
<p>
Prior to execution, you will have to own or create accounts with Zoom, Salesforce, and Mailchimp. <code
class="command">kickstart</code> then begins by prompting you with a short series of questions about the
previously mentioned accounts, as well as Snowflake, Tapestry’s data warehouse of choice. Tapestry stores this
information in the AWS SSM Parameter store. This keeps sensitive data safe, but also accessible.
</p>
<figure>
<img src="assets/images/demo/29_kickstart_questions_stars.gif">
<figcaption>Kickstart command prompts user for inputs.</figcaption>
</figure>
<br />
<p>
After your information has been collected, <code class="command">kickstart</code> continues by creating the
necessary databases and tables within your data warehouse to be utilized by both your ingestion and syncing
tools.
</p>
<br />
<p>
Let’s quickly review the infrastructure that this command is provisioning.
</p>
<figure>
<img src="assets/images/demo/new_data_ingestion_v2.gif" class="large-image">
<figcaption>Data ingestion stack created during deployment.</figcaption>
</figure>
<br />
<p>
Tapestry uses the CloudFormation template supplied during the <code class="command">init</code> command to
create a CloudFormation stack, provisioning AWS resources specifically related to your ingestion tool,
Airbyte. These resources include an S3 staging bucket, an EC2 instance for Airbyte to run on, and an
Application Load Balancer to route traffic to our EC2 instance. Airbyte is then configured to extract certain
data from your Zoom and Salesforce accounts and send it over to your warehouse.
</p>
<figure>
<img src="assets/images/demo/MISC_kickstart_airbyte_v2.gif">
<figcaption>Setup and provisioning of data ingestion stack from the command line.</figcaption>
</figure>
<br />
<p>
You will then be asked to carry out a few steps so the data is transformed in your warehouse using the data
model Tapestry provides for DBT. The raw data will be aggregated from both sources into one transformed table,
filtered for duplicates, and appropriately formatted to be synced to Mailchimp.
</p>
<br />
<figure>
<img src="assets/images/demo/dbt-run.PNG" class="case-study-image">
<figcaption>A successful DBT run, transforming data in the warehouse.</figcaption>
</figure>
<br />
<p>
To complete the pipeline, <code class="command">kickstart</code> creates another CloudFormation stack, this
time spinning up various AWS resources for your syncing tool, Grouparoo.
</p>
<figure>
<img src="assets/images/demo/new_data_syncing.gif" class="large-image">
<figcaption>Data syncing stack created during deployment.</figcaption>
</figure>
<br />
<p>
These resources include an Elastic Container Services cluster to run your Grouparoo application, an Elastic
Container Registry repository with your Grouparoo Docker image stored, and another Application Load Balancer
to route network traffic to your cluster.
</p>
<figure>
<img src="assets/images/demo/kickstart-grouparoo_v3.gif">
<figcaption>Setup and provisioning of data syncing stack from the command line.</figcaption>
</figure>
<br />
<h3>5.5 Tapestry Dashboard</h3>
<br />
<p>
If you are deploying a new pipeline, Tapestry automatically launches your very own local Tapestry Dashboard.
Additionally, anytime you'd like to view the dashboard, you can run the command <code
class="command">tapestry start-server</code> to spin up and launch the UI at http://localhost:7777.
</p>
<figure>
<img src="assets/images/demo/tap_ui_home_v2.png">
</figure>
<br />
<p>
The dashboard contains documentation for how to use Tapestry, along with various pages for each section of
your pipeline. Each page displays metrics that give you better insight into the health of each component. They
also include links to the UIs for all your date pipeline tools: Airbyte, Snowflake, DBT, and Grouparoo.
</p>
<br />
<figure>
<img src="assets/images/demo/tap_ui_carousel_v2.gif">
</figure>
<br />
<p>Some important metrics we track on the dashboard include:</p>
<ul>
<li>Number of data ingestion sources currently operational</li>
<li>Number of data syncing destinations currently operational</li>
<li>EC2 instance status</li>
<li>ECS cluster status</li>