Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 6 additions & 38 deletions services/libs/tinybird/datasources/activities.datasource
Original file line number Diff line number Diff line change
Expand Up @@ -7,79 +7,47 @@ DESCRIPTION >
- `type` specifies the activity type (issues-opened, pull-request-opened, etc.) using LowCardinality.
- `timestamp` is when the activity occurred on the source platform.
- `platform` indicates the source platform (github, discord, slack, etc.) using LowCardinality.
- `isContribution` flag indicates whether this activity counts as a contribution (UInt8 boolean).
- `score` is the computed importance/impact score (-1 default indicates no score computed).
- `sourceId` is the unique identifier from the source platform.
- `createdAt` and `updatedAt` are standard timestamp fields for record lifecycle tracking.
- `sourceParentId` is the parent activity identifier from the source platform (empty string if no parent).
- `attributes` contains additional JSON metadata specific to the activity type.
- `title` and `body` contain the activity's text content (empty string if not applicable).
- `channel` contains the repository, channel, or forum where activity occurred (empty string if not applicable).
- `url` is the direct link to the activity on the source platform (empty string if not available).
- Sentiment analysis fields (`sentimentLabel`, `sentimentScore*`) provide sentiment metrics (-1 default for no analysis).
- Git-specific fields (`git*`) track code changes, branch info, and merge status for code-related activities (0 default for non-git).
- `importHash` is a deterministic hash used for idempotent imports and de-duplication; empty string if not set.
- `deletedAt` is the soft-delete timestamp (epoch `0` = not deleted). Used to mark records logically removed/replaced.
- `memberId` is the internal UUID of the actor (normalized member) who performed the activity; `NULL` if not matched.
- `parentId` is the internal UUID of this activity’s parent within our system (distinct from `sourceParentId`); `NULL` if root.
- `tenantId` identifies the workspace/tenant that owns the data (used for multi-tenancy and filtering).
- `createdById` / `updatedById` are internal UUIDs of the process/user that created/last updated the record (auditing).
- `conversationId` groups related activities into the same thread (e.g., Slack thread, GitHub issue/PR conversation); `NULL` if N/A.
- `username` is the actor’s handle on the source platform at event time (e.g., GitHub login), even if `memberId` is `NULL`.
- `objectMemberId` is the internal UUID of the target/subject member (e.g., assignee, reviewer, mentioned user); `NULL` if N/A.
- `objectMemberUsername` is the target member’s handle on the source platform; empty string if unknown.
- `segmentId` is a logical segment key used to scope/bucket activities (e.g., cohort, product segment); empty string if not segmented.
- `organizationId` is the internal UUID of the owning organization/repository/community when known; `NULL` if unknown.
- `member_isBot` flags whether the actor account is a bot (derived from identity resolution).
- `member_isTeamMember` flags whether the actor is an internal project/team member vs. an external community member.

TAGS "Activity preprocessing pipeline"

SCHEMA >
`id` String `json:$.id`,
`type` LowCardinality(String) `json:$.type`,
`timestamp` DateTime `json:$.timestamp`,
`platform` LowCardinality(String) `json:$.platform`,
`isContribution` UInt8 `json:$.isContribution`,
`score` Int8 `json:$.score` DEFAULT -1,
`sourceId` String `json:$.sourceId`,
`createdAt` DateTime64(3) `json:$.createdAt`,
`updatedAt` DateTime64(3) `json:$.updatedAt`,
`sourceParentId` String `json:$.sourceParentId` DEFAULT '',
`attributes` String `json:$.attributes`,
`title` String `json:$.title` DEFAULT '',
`body` String `json:$.body` DEFAULT '',
`channel` String `json:$.channel` DEFAULT '',
`url` String `json:$.url` DEFAULT '',
`sentimentLabel` String `json:$.sentimentLabel` DEFAULT '',
`sentimentScore` Float32 `json:$.sentimentScore` DEFAULT -1,
`sentimentScoreMixed` Float32 `json:$.sentimentScoreMixed` DEFAULT -1,
`sentimentScoreNeutral` Float32 `json:$.sentimentScoreNeutral` DEFAULT -1,
`sentimentScoreNegative` Float32 `json:$.sentimentScoreNegative` DEFAULT -1,
`sentimentScorePositive` Float32 `json:$.sentimentScorePositive` DEFAULT -1,
`gitIsMainBranch` UInt8 `json:$.gitIsMainBranch` DEFAULT 0,
`gitIsIndirectFork` UInt8 `json:$.gitIsIndirectFork` DEFAULT 0,
`gitLines` Int32 `json:$.gitLines` DEFAULT 0,
`gitInsertions` Int32 `json:$.gitInsertions` DEFAULT 0,
`gitDeletions` Int32 `json:$.gitDeletions` DEFAULT 0,
`gitIsMerge` UInt8 `json:$.gitIsMerge` DEFAULT 0,
`importHash` String `json:$.importHash` DEFAULT '',
`deletedAt` DateTime64(3) `json:$.deletedAt` DEFAULT toDateTime64(0, 3),
`memberId` UUID `json:$.memberId` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`parentId` UUID `json:$.parentId` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`tenantId` LowCardinality(String) `json:$.tenantId` DEFAULT '',
`createdById` UUID `json:$.createdById` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`updatedById` UUID `json:$.updatedById` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`conversationId` UUID `json:$.conversationId` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`username` String `json:$.username` DEFAULT '',
`objectMemberId` UUID `json:$.objectMemberId` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`objectMemberUsername` String `json:$.objectMemberUsername` DEFAULT '',
`segmentId` LowCardinality(String) `json:$.segmentId` DEFAULT '',
`organizationId` UUID `json:$.organizationId` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`member_isBot` UInt8 `json:$.member_isBot` DEFAULT 0,
`member_isTeamMember` UInt8 `json:$.member_isTeamMember` DEFAULT 0

ENGINE ReplacingMergeTree
ENGINE_PARTITION_KEY toYear(createdAt)
ENGINE_SORTING_KEY id
ENGINE_VER updatedAt
`segmentId` LowCardinality(String) `json:$.segmentId` DEFAULT ''

ENGINE "ReplacingMergeTree"
ENGINE_PARTITION_KEY "toYear(createdAt)"
ENGINE_SORTING_KEY "id"
ENGINE_VER "updatedAt"
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
DESCRIPTION >
Backup copy of the raw `activities` datasource. Populated by the
`activities_backup_cleaned_mv` materialized view. Same schema & engine.

SCHEMA >
`id` String,
`type` LowCardinality(String),
`timestamp` DateTime,
`platform` LowCardinality(String),
`score` Int8 DEFAULT -1,
`sourceId` String,
`createdAt` DateTime64(3),
`updatedAt` DateTime64(3),
`attributes` String,
`title` String DEFAULT '',
`body` String DEFAULT '',
`channel` String DEFAULT '',
`url` String DEFAULT '',
`sentimentLabel` String DEFAULT '',
`sentimentScoreMixed` Float32 DEFAULT -1,
`sentimentScoreNeutral` Float32 DEFAULT -1,
`sentimentScoreNegative` Float32 DEFAULT -1,
`sentimentScorePositive` Float32 DEFAULT -1,
`gitIsMainBranch` UInt8 DEFAULT 0,
`gitIsIndirectFork` UInt8 DEFAULT 0,
`gitLines` Int32 DEFAULT 0,
`gitIsMerge` UInt8 DEFAULT 0,
`deletedAt` DateTime64(3) DEFAULT toDateTime64(0, 3),
`tenantId` LowCardinality(String) DEFAULT '',
`createdById` UUID DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`updatedById` UUID DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
`segmentId` LowCardinality(String) DEFAULT ''

ENGINE ReplacingMergeTree
ENGINE_PARTITION_KEY toYear(createdAt)
ENGINE_SORTING_KEY id
ENGINE_VER updatedAt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
DESCRIPTION >
- `activities_deduplicated_ds` contains deduplicated raw activity events without relationship data.
- `activities_deduplicated_ds` contains deduplicated raw activity events without relationship data.
- Created via copy pipe from `activities` datasource with deduplication and field selection for performance.
- Since aggregations are mainly done on relationships, `activityRelations_deduplicated_cleaned_ds` should be used for reporting purposes instead.
- Optimized subset of activity fields focused on core analytics needs.
Expand All @@ -8,18 +8,12 @@ DESCRIPTION >
- `platform` indicates the source platform (github, discord, slack, etc.) using LowCardinality.
- `type` specifies the activity type (issues-opened, pull-request-opened, etc.) using LowCardinality.
- `channel` contains the repository, channel, or forum where activity occurred.
- `isContribution` flag indicates whether this activity counts as a contribution (UInt8 boolean).
- `sourceId` is the unique identifier from the source platform.
- `sourceParentId` is the parent activity identifier from the source platform.
- `sentimentLabel` and `sentimentScore` provide sentiment analysis results.
- `gitChangedLines` tracks the total lines changed for git activities (Int64).
- `gitChangedLinesBucket` categorizes the size of git changes into buckets.
- `score` is the computed importance/impact score for the activity.
- `score` is the computed importance/impact score for the activity.
- `attributes` contains additional JSON metadata specific to the activity type.
- `body` contains the activity’s textual body/content when applicable; empty string if not applicable.
- `title` contains the activity’s title/subject when applicable; empty string if not applicable.
- `url` is the direct link to the activity on the source platform; empty string if not available.

TAGS "Activity preprocessing pipeline"

SCHEMA >
Expand All @@ -28,24 +22,20 @@ SCHEMA >
`platform` LowCardinality(String),
`type` LowCardinality(String),
`channel` String,
`isContribution` UInt8,
`sourceId` String,
`sourceParentId` String,
`sentimentLabel` String,
`sentimentScore` Float32,
`gitChangedLines` Int64,
`gitChangedLinesBucket` String,
`score` Int8,
`attributes` String,
`body` String DEFAULT '',
`title` String DEFAULT '',
`url` String DEFAULT '',
`updatedAt` DateTime64(3)

INDEXES >
idx_body_ngram3 body TYPE ngrambf_v1(3, 2048, 6, 0) GRANULARITY 64
idx_title_ngram3 title TYPE ngrambf_v1(3, 512, 6, 0) GRANULARITY 64
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(timestamp)"
ENGINE_SORTING_KEY "id, platform, channel"

ENGINE MergeTree
ENGINE_PARTITION_KEY toYear(timestamp)
ENGINE_SORTING_KEY id, platform, channel

INDEXES >
idx_body_ngram3 body TYPE ngrambf_v1(3, 2048, 6, 0) GRANULARITY 64,
idx_title_ngram3 title TYPE ngrambf_v1(3, 512, 6, 0) GRANULARITY 64
39 changes: 39 additions & 0 deletions services/libs/tinybird/pipes/activities_backup_cleaned_mv.pipe
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
DESCRIPTION >
Backup mirror of `activities` using a materialized view.
Inserts are appended into `activities_backup`, with logical deduplication
by ReplacingMergeTree (order_by id, version updatedAt).

NODE activities_backup_mv_v1
SQL >
SELECT
id,
type,
timestamp,
platform,
score,
sourceId,
createdAt,
updatedAt,
attributes,
title,
body,
channel,
url,
sentimentLabel,
sentimentScoreMixed,
sentimentScoreNeutral,
sentimentScoreNegative,
sentimentScorePositive,
gitIsMainBranch,
gitIsIndirectFork,
gitLines,
gitIsMerge,
deletedAt,
tenantId,
createdById,
updatedById,
segmentId
FROM activities

TYPE MATERIALIZED
DATASOURCE activities_backup_cleaned
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,8 @@ SQL >
a.platform,
a.type,
a.channel,
a.isContribution,
a.sourceId,
a.sourceParentId,
a.sentimentLabel,
a.sentimentScore,
(a.gitInsertions + a.gitDeletions) AS gitChangedLines,
multiIf(
(a.gitInsertions + a.gitDeletions) BETWEEN 1 AND 9,
'1-9',
(a.gitInsertions + a.gitDeletions) BETWEEN 10 AND 59,
'10-59',
(a.gitInsertions + a.gitDeletions) BETWEEN 60 AND 99,
'60-99',
(a.gitInsertions + a.gitDeletions) BETWEEN 100 AND 499,
'100-499',
(a.gitInsertions + a.gitDeletions) >= 500,
'500+',
''
) AS gitChangedLinesBucket,
a.score,
a.attributes,
a.body,
Expand Down
Loading