Skip to content

Commit 108ec64

Browse files
sapienza88Selim Soufargi
andauthored
[553] Parquet Schema and Column Stats Converters (#669)
* smaller PR for parquet * read parquet file for metadataExtractor: compiling, not testd * cleanups for statsExtractor: compiling, not testd * refactoring for statsExtractor: compiling, not testd * added avro dependency * added tests for SchemaExtractor: int and string primitiveTypes test passes * fixed some minor bugs in SchemaExtractor * close fileReader and handle exception * adjusted fromInternalSchema() * added a test and adjusted SchemaExtractor * added a testing code * bug fix for Schema extractor: groupType * bug fix for Schema extractor * bug fix for tests * bug fix for SchemaExtractor and added tests for nested lists support * bug fix for tests for nested lists support * bug fix for complex test which now passes! * added test for Map * schemaExtractor refactored * bug fixed isNullable() schema * fromInternalSchema : list and map types * decimal primitive test added * float primitive + list and map tests for fromInternalSchema * added tests for primitive type (date and timestamp) * refactoring for partitionValues extractor * git build error fixed * cleanups for schemaExtractor + refactoring for schemaExtractorTests + added test code for statsExtractor * added assertsEqual test for stats + removed partitionFields from the test, TODO check if field is needed in ColumnStats * bug fixed for stats tests: columnStats + tests data are read using FileReader * bug fixed for stats tests, TODO equality test for two objects * added compareFiles() in InternDataFile for the statsExtractor tests to pass: OK * added custom comparison test for ColumnStat and InternDataFile, test passes, TODO: other stat types and other schema types testing * added custom comparison test for ColumnStat (field) and exec spotless apply * tempDir for parquet stats testing * binaryStatistics test passes * added int32 file schema test for statsExtractor * cleanups + added fields comparison for InternalDataFile * cleanups + added fixed_len_byte_array primitive type schema file test * use of genericGetMax instead for stats extraction + cleanups * boolean schema file test for statsExtractor added * removed hard coded path in statsExtractor test * cleanups + imports * separate tests for int and binary for stats * custom equals() not needed for InternalDataFile and ColumnStat * removed parquet version from core sub-project pom * statsExtractor tests as a suite, removed comments + run spotless apply * removed uncessary classes * removed uncessary classes: undo * undo irrelevant changes * fixed formatting issues with spotless:apply cmd * cleanups for test class and fixes for failed build * tmp file name fixed for failed build * cleanups * splotless apply run + assertion internalDataFile equality changed to display errors * fixes for build, PhysicalPath and BinaryStats * fixes for build, PhysicalPath and BinaryStats + synced fork * fixes for build, PhysicalPath and BinaryStats + synced fork * fixes for build and cleanups * fixes for build and cleanups * Parquet dep set as provided to use Spark's * parquet dep version back to 1.15.1 * parquet-avro moved from core to project's pom * parquet-avro moved after hadoop-common * parquet dep scope removed * run spotless:apply --------- Co-authored-by: Selim Soufargi <ssoufargi.idealab.unical@gmail.com~>
1 parent fc4d6e8 commit 108ec64

14 files changed

Lines changed: 1533 additions & 2 deletions

File tree

xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,16 @@
3434
class ExternalTable {
3535
/** The name of the table. */
3636
protected final @NonNull String name;
37+
3738
/** The format of the table (e.g. DELTA, ICEBERG, HUDI) */
3839
protected final @NonNull String formatName;
40+
3941
/** The path to the root of the table or the metadata directory depending on the format */
4042
protected final @NonNull String basePath;
43+
4144
/** Optional namespace for the table */
4245
protected final String[] namespace;
46+
4347
/** The configuration for interacting with the catalog that manages this table */
4448
protected final CatalogConfig catalogConfig;
4549

xtable-api/src/main/java/org/apache/xtable/model/catalog/ThreePartHierarchicalTableIdentifier.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ public class ThreePartHierarchicalTableIdentifier implements HierarchicalTableId
4242
* name varies depending on the catalogType.
4343
*/
4444
String catalogName;
45+
4546
/**
4647
* Catalogs have the ability to group tables logically, databaseName is the identifier for such
4748
* logical classification. The alternate names for this field include namespace, schemaName etc.

xtable-api/src/main/java/org/apache/xtable/model/schema/InternalField.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,11 @@ public class InternalField {
4343
// The id field for the field. This is used to identify the field in the schema even after
4444
// renames.
4545
Integer fieldId;
46+
4647
// represents the fully qualified path to the field (dot separated)
4748
@Getter(lazy = true)
4849
String path = createPath();
50+
4951
// splits the dot separated path into parts
5052
@Getter(lazy = true)
5153
String[] pathParts = splitPath();

xtable-api/src/main/java/org/apache/xtable/model/schema/InternalSchema.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,8 @@ public enum MetadataKey {
7575

7676
public enum MetadataValue {
7777
MICROS,
78-
MILLIS
78+
MILLIS,
79+
NANOS
7980
}
8081

8182
public static final String XTABLE_LOGICAL_TYPE = "xtableLogicalType";

xtable-core/pom.xml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
<artifactId>guava</artifactId>
5757
</dependency>
5858

59-
<!-- Avro -->
59+
6060
<dependency>
6161
<groupId>org.apache.avro</groupId>
6262
<artifactId>avro</artifactId>
@@ -116,6 +116,10 @@
116116
<artifactId>hadoop-common</artifactId>
117117
<scope>provided</scope>
118118
</dependency>
119+
<dependency>
120+
<groupId>org.apache.parquet</groupId>
121+
<artifactId>parquet-avro</artifactId>
122+
</dependency>
119123

120124
<!-- Logging API -->
121125
<dependency>
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing, software
13+
* distributed under the License is distributed on an "AS IS" BASIS,
14+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
* See the License for the specific language governing permissions and
16+
* limitations under the License.
17+
*/
18+
19+
package org.apache.xtable.parquet;
20+
21+
import java.io.IOException;
22+
23+
import org.apache.hadoop.conf.Configuration;
24+
import org.apache.hadoop.fs.Path;
25+
import org.apache.parquet.HadoopReadOptions;
26+
import org.apache.parquet.ParquetReadOptions;
27+
import org.apache.parquet.hadoop.ParquetFileReader;
28+
import org.apache.parquet.hadoop.metadata.ParquetMetadata;
29+
import org.apache.parquet.hadoop.util.HadoopInputFile;
30+
import org.apache.parquet.io.InputFile;
31+
import org.apache.parquet.schema.MessageType;
32+
33+
import org.apache.xtable.exception.ReadException;
34+
35+
public class ParquetMetadataExtractor {
36+
37+
private static final ParquetMetadataExtractor INSTANCE = new ParquetMetadataExtractor();
38+
39+
public static ParquetMetadataExtractor getInstance() {
40+
return INSTANCE;
41+
}
42+
43+
public static MessageType getSchema(ParquetMetadata footer) {
44+
MessageType schema = footer.getFileMetaData().getSchema();
45+
return schema;
46+
}
47+
48+
public static ParquetMetadata readParquetMetadata(Configuration conf, Path filePath) {
49+
InputFile file = null;
50+
try {
51+
file = HadoopInputFile.fromPath(filePath, conf);
52+
} catch (IOException e) {
53+
throw new ReadException("Failed to read the parquet file", e);
54+
}
55+
56+
ParquetReadOptions options = HadoopReadOptions.builder(conf, filePath).build();
57+
try (ParquetFileReader fileReader = ParquetFileReader.open(file, options)) {
58+
return fileReader.getFooter();
59+
} catch (Exception e) {
60+
throw new ReadException("Failed to read the parquet file", e);
61+
}
62+
}
63+
}

0 commit comments

Comments
 (0)