Skip to content

Commit a382e84

Browse files
BUILD: fix linkchecker in 2.x (opensearch-project#3793) (opensearch-project#3795)
* BUILD: fix linkchecker in 2.x * ignore github url * add missing doc links * fix typo --------- (cherry picked from commit 7268bc5) Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent c80e763 commit a382e84

21 files changed

Lines changed: 3101 additions & 1 deletion

.github/workflows/link-checker.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
id: lychee
2020
uses: lycheeverse/lychee-action@master
2121
with:
22-
args: --accept=200,403,429,999 "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*" "http://localhost*" "https://localhost" "https://odfe-node1:9200/" "https://community.tableau.com/docs/DOC-17978" ".*family.zzz" "https://pypi.python.org/pypi/opensearchsql/" "opensearch*" ".*@amazon.com" ".*email.com" "git@github.com" "http://timestamp.verisign.com/scripts/timstamp.dll" ".*/PowerBIConnector/bin/Release"
22+
args: --accept=200,403,429,999 "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*|http://localhost.*|https://localhost|https://odfe-node1:9200/|https://community.tableau.com/docs/DOC-17978|.*family.zzz|https://pypi.python.org/pypi/opensearchsql/|opensearch*|.*@amazon.com|.*email.com|.*github.com|https://github.com/.*|http://timestamp.verisign.com/scripts/timstamp.dll|.*/PowerBIConnector/bin/Release"
2323
env:
2424
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
2525
- name: Fail if there were link errors

docs/dev/datasource-prometheus.md

Lines changed: 331 additions & 0 deletions
Large diffs are not rendered by default.

docs/dev/datasource-query-s3.md

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
## 1.Overview
2+
3+
In this document, we will propose a solution in OpenSearch Observability to query log data stored in S3.
4+
5+
### 1.1.Problem Statements
6+
7+
Currently, OpenSearch Observability is collection of plugins and applications that let you visualize data-driven events by using Piped Processing Language to explore, discover, and query data stored in OpenSearch. The major requirements we heard from customer are
8+
9+
* **cost**, regarding to hardware cost of setup an OpenSearch cluster.
10+
* **ingestion performance,** it is not easy to supporting high throughput raw log ingestion.
11+
* **flexibility,** OpenSearch index required user know their query pattern before ingest data. Which is not flexible.
12+
13+
Can build a new solution for OpenSearch observably uses and leverage S3 as storage. The benefits are
14+
15+
* **cost efficiency**, comparing with OpenSearch, S3 is cheap.
16+
* **high ingestion throughput**, S3 is by design to support high throughput write.
17+
* **flexibility**, user do not need to worry about index mapping define and reindex. everything could be define at query tier.
18+
* **scalability**, user do not need to optimize their OpenSearch cluster for write. S3 is auto scale.
19+
* **data durability**, S3 provide 11s 9 of data durability.
20+
21+
With all these benefits, are there any concerns? The **ability to query S3 in OpenSearch and query performance** are the major concerns. In this doc, we will provide the solution to solve these two major concerns.
22+
23+
## 2.Terminology
24+
25+
* **Catalog**. OpenSearch access external data source through catalog. For example, S3 catalog. Each catalog has attributes, most important attributes is data source access credentials.
26+
* **Table**: To access external data source. User should create external table to describe the schema and location. Table is the virtual concept which does not mapping to OpenSearch index.
27+
* **Materialized View**: User could create view from existing tables. Each view is 1:1 mapping to OpenSearch index. There are two types of views
28+
* (1) Permanent view (default) which is fully managed by user.
29+
* (2) Transient View which is maintained by Maximus automatically. user could decide when to drop the transient view.
30+
31+
## 3.Requirements
32+
33+
### 3.1.Use Cases
34+
35+
#### Use Case - 1: pre-build and query metrics from log on s3
36+
37+
**_Create_**
38+
39+
* User could ingest the log or events directly to s3 with existing ingestion tools (e.g. [fluentbit](https://docs.fluentbit.io/manual/pipeline/outputs/s3)). The ingested log should be partitioned. e.g. _s3://my-raw-httplog/region/yyyy/mm/dd_.
40+
* User configure s3 Catalog in OpenSearch. By default, s3 connector use [ProfileCredentialsProvider](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/profile/ProfileCredentialsProvider.html).
41+
42+
```
43+
settings.s3.access_key: xxxxx
44+
settings.s3.secret_key: xxxxx
45+
```
46+
47+
* User create table in OpenSearch of their log data on s3. Maximus will create table httplog in the s3 catalog
48+
49+
```
50+
CREATE EXTERNAL TABLE `s3`.`httplog`(
51+
@timestamp timestamp,
52+
clientip string,
53+
request string,
54+
state integer,
55+
size long
56+
)
57+
ROW FORMAT SERDE
58+
'json'
59+
'grok', <timestamp> <className>
60+
PARTITION BY
61+
${region}/${year}/${month}/${day}/
62+
LOCATION
63+
's3://my-raw-httplog/';
64+
```
65+
66+
* User create the view *failEventsMetrics*. Note: User could only create view from schema-defined table in Maximus
67+
68+
```
69+
CREATE MATERIALIZED VIEW failEventsMetrics (
70+
cnt long,
71+
time timestamp,
72+
status string
73+
)
74+
AS source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status
75+
WITH (
76+
REFRESH=AUTO
77+
)
78+
```
79+
80+
* Maximus will (1) create view *failEventsMetrics* in the default catalog. (2) create index *failEventsMetrics* in the OpenSearch cluster. (3) refresh *failEventsMetrics* index with logs on s3*.* _Notes: Create view return success when operation (1) and (2) success. Refresh view is an async task will be executed in background asynchronously._
81+
* User could describe the view to monitor the status.
82+
83+
```
84+
DESCRIBE/SHOW MATERIALIZED VIEW failEventsMetrics
85+
86+
# Return
87+
status: INIT | IN_PROGRESS | READY
88+
```
89+
90+
_**Query**_
91+
92+
* User could query the view.
93+
94+
```
95+
source=failEventsMetrics time in last 7 days
96+
```
97+
98+
* User could query the table, Thunder will rewrite table query as view query when optimizing the query.
99+
100+
```
101+
source=`s3`.`httpLog` status != 200 | status count() as cnt by span(1hour), status
102+
```
103+
104+
_**Drop**_
105+
106+
107+
* User could drop the table. Maximus will delete httpLog metadata from catalog and associated view drop.
108+
109+
```
110+
DROP TABLE `s3`.`httpLog`
111+
```
112+
113+
* User could drop the view. Maximus will delete failEventsMetrics metadata from catalog and delete failEventsMetrics indices.
114+
115+
```
116+
DROP MATERIALIZED VIEW failEventsMetrics
117+
```
118+
119+
**Access Control**
120+
121+
* User could not configure any permission on table.
122+
* User could not directly configure any permission on view. But user could configure permission on index associated with the view.
123+
![image1](https://user-images.githubusercontent.com/2969395/182239505-a8541ec1-f46f-4b91-882a-8a4be36f5aea.png)
124+
125+
#### Use Case - 2: Ad-hoc s3 query in OpenSearch
126+
127+
**_Create_**
128+
129+
* Similar as previous example, User could ingest the log or events directly to s3 with existing ingestion tools. create catalog and table from data on s3.
130+
131+
_**Query**_
132+
133+
* User could query table without create view. At runtime, Maximus will create the transient view and populate the view with required data.
134+
135+
```
136+
source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status
137+
```
138+
139+
_**List**_
140+
141+
* User could list all the view of a table no matter how the view is created. User could query/drop/describe the temp view create by Maximus, as same as user created view.
142+
143+
```
144+
LIST VIEW on `s3`.`httpLog`
145+
146+
# return
147+
httplog-tempView-xxxx
148+
```
149+
150+
**Access Control**
151+
152+
* User could not configure any permission on table. During query time, Maximus will use the credentials to access s3 data. It is user’s ownership to configure the permission on s3 data.
153+
* User could not directly configure any permission on view. But user could configure permission on index associated with the view.
154+
![image2](https://user-images.githubusercontent.com/2969395/182239672-72b2cfc6-c22e-4279-b33e-67a85ee6a778.png)
155+
156+
### 3.2.Function Requirements
157+
158+
#### Query S3
159+
160+
* User could query time series data on S3 by using PPL in OpenSearch.
161+
* For querying time series data in S3, user must create a **Catalog** for S3 with required setting.
162+
* access key and secret key
163+
* endpoint
164+
* For querying time series data in S3, user must create a **Table** of time series data on S3.
165+
166+
#### View
167+
168+
* Support create materialized view from time series data on S3.
169+
* Support fully materialized view refresh
170+
* Support manually materialized view incrementally refresh
171+
* Support automatically materialized view incrementally refresh
172+
* Support hybrid scan
173+
* Support drop materialized view
174+
* Support show materialized view
175+
176+
#### Query Optimization
177+
178+
* Inject optimizer rule to rewrite the query with MV to avoid S3 scan
179+
180+
#### Automatic query acceleration
181+
182+
* Automatically select view candidate based on OpenSearch - S3 query execution metrics
183+
* Store workload and selected view info for visualization and user intervention
184+
* Automatically create/refresh/drop view.
185+
186+
#### S3 data format
187+
188+
* The time series data could be compressed with gzip.
189+
* The time series data file format could be.
190+
191+
* JSON
192+
* TSV
193+
194+
* If the time series data should be partitioned and have snapshot id. Query engine could support automatically incremental refresh and hybrid scan.
195+
196+
#### Resource Management
197+
198+
* Support circuit breaker based resource control when executing a query.
199+
* Support task based resource manager
200+
201+
#### Fault Tolerant
202+
203+
* For fast query process, we scarify Fault Tolerant. Support query fast failure in case of hardware failure.
204+
205+
#### Setting
206+
207+
* Support configure of x% of disk automatically create view should used.
208+
209+
### 3.3.Non Function Requirements
210+
211+
#### Security:
212+
213+
There are three types of privileges that are related to materialized views
214+
215+
* Privileges directly on the materialized view itself
216+
* Privileges on the objects (e.g. table) that the materialized view accesses.
217+
218+
*_materialized view itself_*
219+
220+
* User could not directly configure any access control on view. But user could configure any access control on index associated with the view. In the other words, materialized inherits all the access control on the index associated with view.
221+
* When **automatically** **refresh view**, Maximus will use the backend role. It required the backend role has permission to index data.
222+
223+
*_objects (e.g. table) that the materialized view accesses_*
224+
225+
* As with non-materialized views, a user who wishes to access a materialized view needs privileges only on the view, not on the underlying object(s) that the view references.
226+
227+
*_table_*
228+
229+
* User could not configure any privileges on table. *_Note: because the table query could be rewrite as view query If the user do not have required permission to access the view. User could still get no permission exception._*
230+
231+
*_objects (e.g. table) that the table refer_*
232+
233+
* The s3 access control is only evaluated during s3 access. if the table access is rewrite as view access. The s3 access control will not take effect.
234+
235+
*_Encryption_*
236+
237+
* Materialized view data will be encrypted at rest.
238+
239+
#### Others
240+
241+
* Easy to use: the solution should be designed easy to use. It should just work out of the box and provide good performance with minimal setup.
242+
* **Performance**: Use **http_log** dataset to benchmark with OpenSearch cluster.
243+
* Scalability: The solution should be scale horizontally.
244+
* **Serverless**: The solution should be designed easily deployed as Serverless infra.
245+
* **Multi-tenant**: The solution should support multi tenant use case.
246+
* **Metrics**: Todo
247+
* **Log:** Todo
248+
249+
## 4.What is, and what is not
250+
251+
* We design and optimize for observability use cases only. Not OLTP and OLAP.
252+
* We only support time series log data on S3. We do not support query arbitrary data on S3.
79.9 KB
Loading
89.5 KB
Loading
94.3 KB
Loading
70.9 KB
Loading
70.5 KB
Loading

docs/dev/intro-architecture.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# OpenSearch SQL Engine Architecture
2+
3+
---
4+
## 1.Overview
5+
6+
The OpenSearch SQL (OD-SQL) project is developed based on NLPChina project (https://github.com/NLPchina/elasticsearch-sql) which has been deprecated now ([attributions](https://github.com/opensearch-project/sql/blob/main/docs/attributions.md)). Over the one year in development, a lot of features have been added to the OD-SQL project on top of the existing older NLPChina project. The purpose of this document is to explain the OD-SQL current architecture going ahead.
7+
8+
---
9+
## 2.High Level View
10+
11+
In the high level, the OD-SQL Engine could be divided into four major sub-module.
12+
13+
* *Parser*: Currently, there are two Lex&Parser coexists. The Druid Lex&Parser is the original one from NLPChina. The input AST of Core Engine is from the Druid Lex&Parser. The [ANTLR](https://github.com/opensearch-project/sql/blob/main/legacy/src/main/antlr/OpenSearchLegacySqlLexer.g4) Lex&Parser is added by us to customized the verification and exception handling.
14+
* *Analyzer*: The analyzer module take the output from ANTLR Lex&Parser then perform syntax and semantic analyze.
15+
* *Core Engine*: The QueryAction take the output from Druid Lex&Parser and translate to the OpenSearch DSL if possible. This is an NLPChina original module. The QueryPlanner Builder is added by us to support the JOIN and Post-processing logic. The QueryPlanner will take the output from Druid Lex&Parser and build the PhysicalPlan
16+
* *Execution*: The execution module execute QueryAction or QueryPlanner and return the response to the client. Different from the Frontend, Analyzer and Core Engine which running on the Transport Thread and can’t do any blocking operation. The Execution module running on the client threadpool and can perform the blocking operation.
17+
18+
There are also others modules include in the OD-SQL engine.
19+
20+
* _Documentation_: it is used to auto-generated documentation.
21+
* _Metrics_: it is used to collect OD-SQL related metrics.
22+
* _Resource Manager_: it is used to monitor the memory consumption when performing join operation to avoid the impact to OpenSearch availability.
23+
24+
![Architecture Overview](img/architecture-overview.png)
25+
26+
---
27+
## 3.Journey of the query in OD-SQL engine.
28+
29+
The following diagram take a sample query and explain how the query flow within different modules.
30+
31+
![Architecture Journey](img/architecture-journey.png)
32+
33+
1. The ANTRL parser based on grammar file (https://github.com/opensearch-project/sql/blob/main/legacy/src/main/antlr/OpenSearchLegacySqlParser.g4) to auto generate the AST.
34+
2. The Syntax and Semantic Analyzer will walk through the AST and verify whether the query is follow the grammar and supported by the OD-SQL. e.g. *SELECT * FROM semantics WHERE LOG(age, city) = 1, *will throw exception with message* Function [LOG] cannot work with [INTEGER, KEYWORD]. *and sample usage message* Usage: LOG(NUMBER T) → DOUBLE.
35+
3. The Druid Lex&Parser takes the input query and generate the druid AST which is different from the AST generated by the ANTRL. This module is the open source library (https://github.com/alibaba/druid) used by NLPChina originally.
36+
4. The QueryPlanner Builder take the AST as input and generate the LogicalPlan from it. Then it optimize the LogicalPlan to PhysicalPlan.(In current implementation, only rule-based model is implemented). The major part of PhysicalPlan generation use NLPChina’s original logic to translate the SQL expression in AST to OpenSearch DSL.
37+
5. The QueryPlanner executor execute the PhysicalPlan in worker thread.
38+
6. The formatter will reformat the response data to the required format. The default format is JDBC format.
39+

docs/dev/intro-v2-engine.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# SQL Engine V2 - Release Notes
2+
3+
---
4+
## 1.Motivations
5+
6+
The current SQL query engine provides users the basic query capability for using familiar SQL rather than complex OpenSearch DSL. Based on NLPchina ES-SQL, many new features have been added additionally, such as semantic analyzer, semi-structured data query support, Hash Join etc. However, as we looked into more advanced SQL features, challenges started emerging especially in terms of correctness and extensibility (see [Attributions](../attributions.md)). After thoughtful consideration, we decided to develop a new query engine to address all the problems met so far.
7+
8+
9+
---
10+
## 2.What's New
11+
12+
With the architecture and extensibility improved significantly, the following SQL features are able to be introduced in the new query engine:
13+
14+
* **Language Structure**
15+
* [Identifiers](../../docs/user/general/identifiers.rst): added support for identifier names with special characters
16+
* [Data types](../../docs/user/general/datatypes.rst): added support for date and interval types
17+
* [Expressions](../../docs/user/dql/expressions.rst): complex nested expression support
18+
* [SQL functions](../../docs/user/dql/functions.rst): more date function support, `ADDDATE`, `DATE_ADD`, `DATE_SUB`, `DAY`, `DAYNAME`, `DAYOFMONTH`, `DAYOFWEEK`, `DAYOFYEAR`, `FROM_DAYS`, `HOUR`, `MICROSECOND`, `MINUTE`, `QUARTER`, `SECOND`, `SUBDATE`, `TIME`, `TIME_TO_SEC`, `TO_DAYS`, `WEEK`
19+
* [Comments](../../docs/user/general/comments.rst): SQL comment support
20+
* **Basic queries**
21+
* [HAVING without GROUP BY clause](../../docs/user/dql/aggregations.rst#having-without-group-by)
22+
* [Aggregate over arbitrary expression](../../docs/user/dql/aggregations.rst#expression)
23+
* [Ordering by NULLS FIRST/LAST](../../docs/user/dql/basics.rst#example-2-specifying-order-for-null)
24+
* [Ordering by aggregate function](../../docs/user/dql/basics.rst#example-3-ordering-by-aggregate-functions)
25+
* **Complex queries**
26+
* [Subqueries in FROM clause](../../docs/user/dql/complex.rst#example-2-subquery-in-from-clause): support arbitrary nesting level and aggregation
27+
* **Advanced Features**
28+
* [Window functions](../../docs/user/dql/window.rst): ranking and aggregate window functions
29+
* [Selective aggregation](../../docs/user/dql/aggregations.rst#filter-clause): by standard `FILTER` function
30+
* **Beyond SQL**
31+
* [Semi-structured data query](../../docs/user/beyond/partiql.rst#example-2-selecting-deeper-levels): support querying OpenSearch object fields on arbitrary level
32+
* OpenSearch multi-field: handled automatically and users won't have the access, ex. `text` is converted to `text.keyword` if it’s a multi-field
33+
34+
As for correctness, besides full coverage of unit and integration test, we developed a new comparison test framework to ensure correctness by comparing with other databases. Please find more details in [Testing](./testing-comparison-test.md).
35+
36+
37+
---
38+
## 3.What're Changed
39+
40+
### 3.1 Breaking Changes
41+
42+
Because of implementation changed internally, you can expect Explain output in a different format. For query protocol, there are slightly changes in the default response format:
43+
44+
* **Total**: The `total` field represented how many documents matched in total no matter how many returned (indicated by `size` field). However, this field becomes meaningless because of post processing on DSL response in the new query engine. Thus, for now the total number is always same as size field.
45+
46+
### 3.2 Fallback Mechanism
47+
48+
For these unsupported features, the query will be forwarded to the old query engine by fallback mechanism. To avoid impact on your side, normally you won't see any difference in a query response. If you want to check if and why your query falls back to be handled by old SQL engine, please explain your query and check OpenSearch log for "Request is falling back to old SQL engine due to ...".
49+
50+
For the following features unsupported in the new engine, the query will be forwarded to the old query engine and thus you cannot use new features listed above:
51+
52+
* **Cursor**: request with `fetch_size` parameter
53+
* **JSON response format**: was used to return OpenSearch DSL which is not accessible now. Replaced by default format in the new engine which is also in JSON.
54+
* **Nested field query**: including supports for nested field query
55+
* **JOINs**: including all types of JOIN queries
56+
* **OpenSearch functions**: fulltext search, metric and bucket functions
57+
58+
### 3.3 Limitations
59+
60+
You can find all the limitations in [Limitations](../../docs/user/limitations/limitations.rst).
61+
62+
63+
---
64+
## 4.How it's Implemented
65+
66+
If you're interested in the new query engine, please find more details in [Developer Guide](../../DEVELOPER_GUIDE.rst), [Architecture](./intro-architecture.md) and other docs in the dev folder.
67+
68+
69+
---
70+
## 5.What's Next
71+
72+
As aforementioned, there are still popular SQL features unsupported in the new query engine yet. In particular, the following items are on our roadmap with high priority:
73+
74+
1. Nested field queries
75+
2. JOIN support
76+
3. OpenSearch functions

0 commit comments

Comments
 (0)