Skip to content

Add getEstimatedSize API to OutputCodec#5593

Merged
kkondaka merged 3 commits into
opensearch-project:mainfrom
kkondaka:output-codec-size
Apr 10, 2025
Merged

Add getEstimatedSize API to OutputCodec#5593
kkondaka merged 3 commits into
opensearch-project:mainfrom
kkondaka:output-codec-size

Conversation

@kkondaka
Copy link
Copy Markdown
Collaborator

Description

Add getEstimatedSize API to OutputCodec.
Returns exact or slightly more size of an event when passed through the OutputCodec.

Issues Resolved

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [X ] New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • [X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

At a high-level, I think we can get a better size estimate with the refactoring I suggested because it will account for what is already there.

* @return int size of the serialized event
* @throws IOException throws IOException when invalid input is received or not able to create wrapping
*/
int getEstimatedSize(Event event) throws IOException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return a long.

writer.write(getParquetRecord(event));
}

private int getSizeInBytes(GenericRecord record) throws IOException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure that this will be accurate. But, I'm not sure there is anyway to really do this correctly in Parquet.

* @return int size of the serialized event
* @throws IOException throws IOException when invalid input is received or not able to create wrapping
*/
int getEstimatedSize(Event event) throws IOException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the SinkContext here.

* @return int size of the serialized event
* @throws IOException throws IOException when invalid input is received or not able to create wrapping
*/
int getEstimatedSize(Event event) throws IOException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could have this as a default implementation and remove the concrete implementations for most of the sub-classes?

ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null);
start(outputStream, event, sinkContext);
write(event);
return outputStream.toByteArray().length;

* @return int size of the serialized event
* @throws IOException throws IOException when invalid input is received or not able to create wrapping
*/
int getEstimatedSize(Event event) throws IOException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If you are really ambitious, I've thought about a refactoring in the past, that could also make this more accurate.

The OutputCodec is currently designed as a single codec for all OutputStreams. However, it would make more sense for OutputCodec to be more like a factory.

public interface OutputCodec {
  public interface BoundToStreamCodec {
    void writeEvent(Event event);
    void complete();
    int getEstimatedSize();
  }
  codecFor(final OutputStream outputStream, Event event, final OutputCodecContext codecContext); // replaces start();
}

The callers use it like this:

BoundToStreamCodec codec = outputCodec.codecFor(outputStream, event, codecContext); // replaces .start()
...
for(...) {
  codec.writeEvent(event); // it already has an OutputStream so need to pass again.
}
...
codec.complete();

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
@kkondaka kkondaka merged commit 6fa2d7c into opensearch-project:main Apr 10, 2025
amdhing pushed a commit to amdhing/data-prepper that referenced this pull request Apr 16, 2025
* Add getEstimatedSize API to OutputCodec

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Fixed build failure

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Addressed review comments

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Davidding4718 pushed a commit to Davidding4718/data-prepper that referenced this pull request Apr 25, 2025
* Add getEstimatedSize API to OutputCodec

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Fixed build failure

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Addressed review comments

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Davidding4718 pushed a commit to Davidding4718/data-prepper that referenced this pull request Apr 25, 2025
* Add getEstimatedSize API to OutputCodec

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Fixed build failure

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Addressed review comments

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
@kkondaka kkondaka deleted the output-codec-size branch July 1, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants