Skip to content

feat: add Hive2Namespace implementation for Python#199

Merged
jackye1995 merged 6 commits into
lance-format:mainfrom
jackye1995:python-hive
Aug 25, 2025
Merged

feat: add Hive2Namespace implementation for Python#199
jackye1995 merged 6 commits into
lance-format:mainfrom
jackye1995:python-hive

Conversation

@jackye1995

Copy link
Copy Markdown
Collaborator

Summary

  • Implements a Hive2 namespace adapter for lance-namespace Python client
  • Adds integration with Apache Hive Metastore via Thrift protocol
  • Enables Lance table management through Hive Metastore

Changes

Core Implementation

  • python/lance_namespace/src/lance_namespace/hive.py: Complete Hive2Namespace implementation with:
    • HiveMetastoreClient helper class for Thrift connection management
    • Full namespace operations (list, create, drop, describe)
    • Complete table operations (register, create, drop, deregister, describe, list)
    • Schema conversion between PyArrow and Hive formats
    • Support for optional UGI authentication

Dependencies

  • python/lance_namespace/pyproject.toml: Added optional hive2 extra with:
    • thrift>=0.13.0
    • hive-metastore-client>=1.0.9
    • Install with: pip install 'lance-namespace[hive2]'

Tests

  • python/lance_namespace/tests/test_hive.py: Comprehensive test suite covering:
    • Initialization with and without dependencies
    • All namespace and table operations
    • Error handling and edge cases
    • Schema conversion utilities

Documentation

  • docs/src/impls/hive.md: Added Python-specific documentation with usage examples
  • python/lance_namespace/README.md: Updated with Hive2 backend instructions

Registration

  • python/lance_namespace/src/lance_namespace/namespace.py: Registered hive2 implementation

Usage Example

from lance_namespace import connect

# Connect to Hive Metastore
namespace = connect("hive2", {
    "uri": "thrift://localhost:9083",
    "warehouse": "/user/hive/warehouse",
    "ugi": "user:group1,group2"  # Optional
})

# List databases
from lance_namespace import ListNamespacesRequest
response = namespace.list_namespaces(ListNamespacesRequest())

# Create a Lance table
import pyarrow as pa
import io

data = pa.table({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
buf = io.BytesIO()
with pa.ipc.new_stream(buf, data.schema) as writer:
    writer.write_table(data)

from lance_namespace import CreateTableRequest
request = CreateTableRequest(
    id=["database", "table_name"],
    mode="create"
)
response = namespace.create_table(request, buf.getvalue())

Test Plan

  • Unit tests for all Hive2Namespace operations
  • Mock-based testing for Hive client interactions
  • Integration tests with actual Hive Metastore (requires HMS setup)

🤖 Generated with Claude Code

@github-actions github-actions Bot added enhancement New feature or request python Python features labels Aug 23, 2025
@jackye1995 jackye1995 force-pushed the python-hive branch 2 times, most recently from 2cf68fb to e6d96b3 Compare August 24, 2025 03:02
jackye1995 and others added 2 commits August 23, 2025 20:05
Implements a Hive2 namespace adapter for lance-namespace Python client
that integrates with Apache Hive Metastore.

Key changes:
- Add optional hive2 dependencies in pyproject.toml
- Implement Hive2Namespace class with full namespace and table operations
- Add shared utils module for PyArrow to JSON schema conversion
- Add comprehensive test suite with mocked Hive client
- Register hive2 implementation in namespace factory

The implementation:
- Connects to Hive Metastore via Thrift protocol
- Manages Lance tables as external tables in Hive
- Supports all namespace operations (list, create, drop, describe)
- Supports all table operations (register, create, drop, query)
- Converts between PyArrow and Hive schemas
- Includes comprehensive docstring with usage examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updates the Hive2Namespace implementation to be consistent with the
documented specification in hive.md:

Configuration changes:
- Use 'root' instead of 'warehouse' for storage root location
- Add 'ugi' to configuration properties documentation
- Support 'client.pool-size' and 'storage.*' properties

Root namespace handling:
- list_namespaces: Only list from root namespace
- describe_namespace: Support describing root namespace
- create_namespace: Reject creating root (already exists)
- drop_namespace: Reject dropping root namespace
- namespace_exists: Root namespace always exists
- list_tables: Return empty list for root namespace

Table metadata:
- Use 'table_type' key (not 'lance.table_type') per spec
- Set 'managed_by' property (default: 'storage')
- Use case-insensitive matching for 'lance' table type
- Include 'version' key for table version tracking

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
jackye1995 and others added 2 commits August 25, 2025 15:12
Modified describe_table method to only return Hive metadata without opening
the Lance dataset. This makes the operation more lightweight and faster.

Changes:
- Remove dataset opening logic from describe_table
- Return schema as None (var_schema field)
- Parse version from Hive parameters instead of dataset
- Add test to verify the new behavior

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Standardized the input/output formats and SerDe configuration for Lance tables
in Hive Metastore across both Python and Java implementations:

- Set inputFormat to com.lancedb.lance.mapred.LanceInputFormat
- Set outputFormat to com.lancedb.lance.mapred.LanceOutputFormat
- Set serializationLib to com.lancedb.lance.mapred.LanceSerDe

This ensures consistency when tables are registered from either implementation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions github-actions Bot added the java Java features label Aug 25, 2025
jackye1995 and others added 2 commits August 25, 2025 15:26
…ged_by

Updated register_table to follow proper versioning semantics:
- Only set version parameter when managed_by is "impl"
- When managed_by is "storage" (default), version is not tracked in Hive
- Removed unnecessary "EXTERNAL": "TRUE" parameter

This aligns with the specification where version tracking is only needed
when the table is implementation-managed.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jackye1995

Copy link
Copy Markdown
Collaborator Author

looks good to me

@jackye1995 jackye1995 merged commit 8022160 into lance-format:main Aug 25, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Java features python Python features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant