Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink approach to [ and ] in column names (currently throws error) #329

Closed
simonw opened this issue Sep 23, 2021 · 12 comments
Closed

Rethink approach to [ and ] in column names (currently throws error) #329

simonw opened this issue Sep 23, 2021 · 12 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 23, 2021

I think it's best to still keep [ and ] out of column names though. Transforming them into ( and ) seems reasonable - but should that happen here or in sqlite-utils? I think in sqlite-utils.

Originally posted by @simonw in simonw/datasette-app#121 (comment)

This is a rethinking of the solution to:

@simonw simonw added the enhancement New feature or request label Sep 23, 2021
@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

Quoting #86 (comment) in full:

I'm not sure what to do about this one.

I can't fix it: this bug in Python's sqlite3 module means that even if I write a database out with column names that include [] I won't be able to read them back again.

So... I could do one of the following:

  • Throw an error if a column name includes those characters. That's my preferred option I think.
  • Automatically replace [ in column names with ( and ] with )
  • Do the automatic replacement but show a user-visible warning when I do it
  • Throw an error, but give the user an option to run with e.g. --fix-column-names which applies that automatic fix.

Since this is likely to be an incredibly rare edge-case I think I'd rather minimize the amount of code that deals with it, so my preferred option is to just throw that error and stop.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

So either I do the automatic replacement, or I let the user request automatic replacement, or a third option: I do automatic replacement but let the user opt to receive an error instead.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

Here's the code where this happens:

def validate_column_names(columns):
# Validate no columns contain '[' or ']' - #86
for column in columns:
assert (
"[" not in column and "]" not in column
), "'[' and ']' cannot be used in column names"

It's called from three different methods in db.py: create_table_sql(), update() and insert_all().

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

If I add a new parameter for opting in and out of fixing these, what should it be called? A few options:

  • fix_columns=False
  • rename_bad_columns=False
  • fix_column_names=False
  • error_on_invalid_column_names=True

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

I'm inclined to just fix them and not have an option for opting-out of fixing them, since it adds quite a bit of cruft to the overall API design for an option that maybe no-one will ever use.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

What are my options for replacing those characters?

  • [ becomes ( and ] becomes )
  • [ and ] are removed entirely
  • [ and ] both become _

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

I think I like the underscore option best. I don't like the idea of injecting surprise ( ) parenthesis, and having them vanish entirely could result in things like item[price] becoming itemprice which feels confusing.

item_price_ is a little ugly but I think I can live with it.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2021

I could even have those replacement characters be properties of the Database class, so uses can sub-class and change them.

class Database:
    left_brace_replace = "_"
    right_brace_replace = "_"
    ...

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

I could even have those replacement characters be properties of the Database class, so uses can sub-class and change them.

I'm not going to do this, it's unnecessary extra complexity and it means the function that fixes the column names needs to have access to the current Database instance.

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

If I replace validate_column_names(row.keys()) with fix_column_names(row) I need to decide what to do about things like pk= and column_order=.

What should the following do?

table.insert({"foo[bar]": 4}, pk="foo[bar]", column_order=["foo[bar]"])

Should it spot the old column names in the pk= and column_order= parameters and pretend that foo_bar_ was passed instead?

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

I'm not going to implement a fix that rewrites the pk and column_order and other parameters - at least not yet. The main thing I'm trying to fix here is what happens when you attempt to import a CSV file with [ ] in the column names, which should be unaffected by that second challenge.

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

I was going to replace all of the validate_column_names() bits with something that fixed them instead, but I think I have a better idea: I'm only going to apply the fix for the various '.insert()` methods that create the initial tables.

I'll keep the validate_column_names() where they are at the moment. Once you've inserted the data and created the tables it will be up to you to use the new, correct column names.

This avoids the whole issue of needing to rewrite parameters, and solves the immediate problem which is consuming CSV files with bad column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant