Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use statistics in Faker CTAS #24098

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nineinchnick
Copy link
Member

@nineinchnick nineinchnick commented Nov 11, 2024

Description

Use statistics when using CREATE TABLE AS SELECT in the Faker connector to:

  • set the default_limit table property to the estimated number of rows from the source table
  • create a view to read from the table with range predicates, based on the statistics
  • detect high-cardinality BIGINT columns and use sequences for them
  • detect low-cardinality columns and generate dictionaries to select values from

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Faker
* Use statistics when using `CREATE TABLE AS SELECT` in the Faker connector. ({issue}`issuenumber`)

When creating a table in the Faker connector from an existing table,
gather column statistics to determine range constraints, and create a
view that uses them as predicates.
@cla-bot cla-bot bot added the cla-signed label Nov 11, 2024

public class FakerMetadata
implements ConnectorMetadata
{
public static final String SCHEMA_NAME = "default";
public static final String ROW_ID_COLUMN_NAME = "$row_id";
private static final String VIEW_SUFFIX = "_view";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be configurable, but I need suggestions for config option name, default value, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it feels like the name user provided should point to view, and the table should have a suffix as it is kinda an implementaiton detail.

Maybe if user does CREATE TABLE xyz AS ... you can create table xyz$raw and a view xyz which references xyz$raw. Or do you think it will to too much of WTF for users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be too implicit. Maybe if we'd use some procedure instead of CTAS. The view is also supposed to be an example or a starting point - it can be recreated with additional predicates, better projections, etc.

return quoted(name);
}
// distinct values is an approximation
if ((double) distinctValues.get(name) / rowCount.get() < 0.98) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold might be configurable

while (page == null) {
page = pageSource.getNextPage();
}
checkState(maxDictionarySize <= page.getPositionCount(), "Max dictionary size cannot be greater than %d".formatted(page.getPositionCount()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add a @Max() in the configuration and schema/table property validators?

When creating a table in the Faker connector from an existing table,
using column statistics determine unique columns, and use a sequence for
them, instead of random values, in the view associated with that table.
When creating a table in the Faker connector from an existing table,
using column statistics determine low cardinality columns, and generate
values from a randomly generated set.
@@ -297,7 +327,7 @@ public synchronized FakerOutputTableHandle beginCreateTable(ConnectorSession ses
double schemaNullProbability = (double) schema.properties().getOrDefault(SchemaInfo.NULL_PROBABILITY_PROPERTY, nullProbability);
double tableNullProbability = (double) tableMetadata.getProperties().getOrDefault(TableInfo.NULL_PROBABILITY_PROPERTY, schemaNullProbability);

ImmutableList.Builder<ColumnInfo> columns = ImmutableList.builder();
ImmutableList.Builder<ColumnInfo> columnsBuilder = ImmutableList.builder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert?

public synchronized Map<SchemaTableName, ConnectorViewDefinition> getViews(ConnectorSession session, Optional<String> schemaName)
{
SchemaTablePrefix prefix = schemaName.map(SchemaTablePrefix::new).orElseGet(SchemaTablePrefix::new);
return ImmutableMap.copyOf(Maps.filterKeys(views, prefix::matches));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import filterKeys

{
checkArgument(viewProperties.isEmpty(), "This connector does not support creating views with properties");
checkSchemaExists(viewName.getSchemaName());
if (views.containsKey(viewName) && !replace) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need also to check if such table exists

view.getCatalog(),
view.getSchema(),
view.getColumns().stream()
.map(currentViewColumn -> Objects.equals(columnName, currentViewColumn.getName()) ? new ConnectorViewDefinition.ViewColumn(currentViewColumn.getName(), currentViewColumn.getType(), comment) : currentViewColumn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use columnName.equals not Objects.equals


return Optional.empty();
}

private synchronized void createViewWithPredicates(SchemaTableName tableName, List<ColumnInfo> columns, Collection<ComputedStatistics> computedStatistics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call out to standard view creation logic to reuse already-exists checks?

if (!type.equals(BIGINT) || !minimums.containsKey(name)) {
return quoted(name);
}
if ((double) distinctValues.get(name) / rowCount.get() < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get why are you using < 1 here.
Should it be rather something like abs(1.0 - (double) distinctValues.get(name) / rowCount.get()) < some_epsilon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have been late when I wrote this, this shows I need better test cases.

@@ -425,6 +425,15 @@ public synchronized Optional<ConnectorOutputMetadata> finishCreateTable(
TableInfo info = tables.get(tableName);
requireNonNull(info, "info is null");

if (!computedStatistics.isEmpty() && !info.properties().containsKey(TableInfo.DEFAULT_LIMIT_PROPERTY)) {
ComputedStatistics statistic = computedStatistics.stream().findFirst().orElseThrow();
long rowCount = (long) getValue(BIGINT, 0, statistic.getTableStatistics().get(TableStatisticType.ROW_COUNT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import ROW_COUNT

long rowCount = (long) getValue(BIGINT, 0, statistic.getTableStatistics().get(TableStatisticType.ROW_COUNT));
info = info.withProperties(ImmutableMap.<String, Object>builder()
.putAll(info.properties())
.put(TableInfo.DEFAULT_LIMIT_PROPERTY, rowCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import DEFAULT_LIMIT_PROPERTY

Comment on lines +148 to +154
longProperty(
TableInfo.MAX_DICTIONARY_SIZE,
"""
Maximum size of randomly generated dictionaries to pick values from, used for columns with low number of approximate distinct values
observed during table creation using existing data. Set to zero to disable using dictionaries""",
null,
maxDictionarySize -> checkProperty(0 <= maxDictionarySize, INVALID_TABLE_PROPERTY, "max_dictionary_size value must be equal or greater than 0"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property looks weird on table level. Should you rather have per-column property saying if the column is dictionary or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows disabling dictionaries for an individual table. Column properties cannot be set when running a CTAS.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Added some comments. Feel free to ignore most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants