HIVE-29524: Missing num_nulls statistic for partition columns#6410
HIVE-29524: Missing num_nulls statistic for partition columns#6410tanishq-chugh wants to merge 2 commits into
Conversation
zabetak
left a comment
There was a problem hiding this comment.
Thanks for the PR @tanishq-chugh ! I left some refactoring suggestions. Apart from that it seems that some .q.out files need to be updated.
| private static long getNumNullsForPartCol(PartitionIterable partitions, String partColName, HiveConf conf) { | ||
| long numNulls = 0; | ||
| String defaultPartitionName = HiveConf.getVar(conf, HiveConf.ConfVars.DEFAULT_PARTITION_NAME); | ||
| for (Partition partition : partitions) { | ||
| String partVal = partition.getSpec().get(partColName); | ||
| if (partVal != null && partVal.equals(defaultPartitionName)) { | ||
| Map<String, String> parameters = partition.getParameters(); | ||
| if (parameters != null && parameters.get(StatsSetupConst.ROW_COUNT) != null) { | ||
| long rowCount = Long.parseLong(parameters.get(StatsSetupConst.ROW_COUNT)); | ||
| if (rowCount > 0) { | ||
| numNulls = safeAdd(numNulls, rowCount); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| return numNulls; | ||
| } | ||
|
|
There was a problem hiding this comment.
I am wondering if we could take advantage of the existing StatsUtils#getNumRows method to some extend. At the very least we may be able to reuse some existing classes such as org.apache.hadoop.hive.ql.stats.BasicStats.
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |



What changes were proposed in this pull request?
num_nulls statistics should be computed for partition columns
Why are the changes needed?
Currently, the num_nulls statistics is not populated and is always zero which is wrong information to the user and also, any estimations that rely on ColStatistics.getNumNulls will also be inaccurate.
Does this PR introduce any user-facing change?
Yes, num_nulls metrics which was not populated earlier and always defaulted to zero, will be rightly computed and visible to user.
How was this patch tested?
Manual Testing & Qtest