-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use statistics in Faker CTAS #24098
base: master
Are you sure you want to change the base?
Use statistics in Faker CTAS #24098
Conversation
When creating a table in the Faker connector from an existing table, gather column statistics to determine range constraints, and create a view that uses them as predicates.
|
||
public class FakerMetadata | ||
implements ConnectorMetadata | ||
{ | ||
public static final String SCHEMA_NAME = "default"; | ||
public static final String ROW_ID_COLUMN_NAME = "$row_id"; | ||
private static final String VIEW_SUFFIX = "_view"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be configurable, but I need suggestions for config option name, default value, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it feels like the name user provided should point to view, and the table should have a suffix as it is kinda an implementaiton detail.
Maybe if user does CREATE TABLE xyz AS ...
you can create table xyz$raw
and a view xyz
which references xyz$raw
. Or do you think it will to too much of WTF for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be too implicit. Maybe if we'd use some procedure instead of CTAS
. The view is also supposed to be an example or a starting point - it can be recreated with additional predicates, better projections, etc.
return quoted(name); | ||
} | ||
// distinct values is an approximation | ||
if ((double) distinctValues.get(name) / rowCount.get() < 0.98) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold might be configurable
while (page == null) { | ||
page = pageSource.getNextPage(); | ||
} | ||
checkState(maxDictionarySize <= page.getPositionCount(), "Max dictionary size cannot be greater than %d".formatted(page.getPositionCount())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a @Max()
in the configuration and schema/table property validators?
When creating a table in the Faker connector from an existing table, using column statistics determine unique columns, and use a sequence for them, instead of random values, in the view associated with that table.
When creating a table in the Faker connector from an existing table, using column statistics determine low cardinality columns, and generate values from a randomly generated set.
46c3f8e
to
ae38700
Compare
@@ -297,7 +327,7 @@ public synchronized FakerOutputTableHandle beginCreateTable(ConnectorSession ses | |||
double schemaNullProbability = (double) schema.properties().getOrDefault(SchemaInfo.NULL_PROBABILITY_PROPERTY, nullProbability); | |||
double tableNullProbability = (double) tableMetadata.getProperties().getOrDefault(TableInfo.NULL_PROBABILITY_PROPERTY, schemaNullProbability); | |||
|
|||
ImmutableList.Builder<ColumnInfo> columns = ImmutableList.builder(); | |||
ImmutableList.Builder<ColumnInfo> columnsBuilder = ImmutableList.builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert?
public synchronized Map<SchemaTableName, ConnectorViewDefinition> getViews(ConnectorSession session, Optional<String> schemaName) | ||
{ | ||
SchemaTablePrefix prefix = schemaName.map(SchemaTablePrefix::new).orElseGet(SchemaTablePrefix::new); | ||
return ImmutableMap.copyOf(Maps.filterKeys(views, prefix::matches)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: static import filterKeys
{ | ||
checkArgument(viewProperties.isEmpty(), "This connector does not support creating views with properties"); | ||
checkSchemaExists(viewName.getSchemaName()); | ||
if (views.containsKey(viewName) && !replace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need also to check if such table exists
view.getCatalog(), | ||
view.getSchema(), | ||
view.getColumns().stream() | ||
.map(currentViewColumn -> Objects.equals(columnName, currentViewColumn.getName()) ? new ConnectorViewDefinition.ViewColumn(currentViewColumn.getName(), currentViewColumn.getType(), comment) : currentViewColumn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use columnName.equals
not Objects.equals
|
||
return Optional.empty(); | ||
} | ||
|
||
private synchronized void createViewWithPredicates(SchemaTableName tableName, List<ColumnInfo> columns, Collection<ComputedStatistics> computedStatistics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call out to standard view creation logic to reuse already-exists checks?
if (!type.equals(BIGINT) || !minimums.containsKey(name)) { | ||
return quoted(name); | ||
} | ||
if ((double) distinctValues.get(name) / rowCount.get() < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not get why are you using < 1
here.
Should it be rather something like abs(1.0 - (double) distinctValues.get(name) / rowCount.get()) < some_epsilon
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must have been late when I wrote this, this shows I need better test cases.
@@ -425,6 +425,15 @@ public synchronized Optional<ConnectorOutputMetadata> finishCreateTable( | |||
TableInfo info = tables.get(tableName); | |||
requireNonNull(info, "info is null"); | |||
|
|||
if (!computedStatistics.isEmpty() && !info.properties().containsKey(TableInfo.DEFAULT_LIMIT_PROPERTY)) { | |||
ComputedStatistics statistic = computedStatistics.stream().findFirst().orElseThrow(); | |||
long rowCount = (long) getValue(BIGINT, 0, statistic.getTableStatistics().get(TableStatisticType.ROW_COUNT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: static import ROW_COUNT
long rowCount = (long) getValue(BIGINT, 0, statistic.getTableStatistics().get(TableStatisticType.ROW_COUNT)); | ||
info = info.withProperties(ImmutableMap.<String, Object>builder() | ||
.putAll(info.properties()) | ||
.put(TableInfo.DEFAULT_LIMIT_PROPERTY, rowCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: static import DEFAULT_LIMIT_PROPERTY
longProperty( | ||
TableInfo.MAX_DICTIONARY_SIZE, | ||
""" | ||
Maximum size of randomly generated dictionaries to pick values from, used for columns with low number of approximate distinct values | ||
observed during table creation using existing data. Set to zero to disable using dictionaries""", | ||
null, | ||
maxDictionarySize -> checkProperty(0 <= maxDictionarySize, INVALID_TABLE_PROPERTY, "max_dictionary_size value must be equal or greater than 0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property looks weird on table level. Should you rather have per-column property saying if the column is dictionary or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows disabling dictionaries for an individual table. Column properties cannot be set when running a CTAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Added some comments. Feel free to ignore most
Description
Use statistics when using
CREATE TABLE AS SELECT
in the Faker connector to:default_limit
table property to the estimated number of rows from the source tableBIGINT
columns and use sequences for themAdditional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: