Skip to content

Add enhanced header management utilities #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

krishivseth
Copy link

Enhanced Header Management for Scautable

This PR introduces a comprehensive header management system for Scautable, providing robust utilities for handling CSV headers. The implementation includes:

  1. Header Normalization: Converts headers to valid Scala identifiers, handles spaces/special characters, ensures uniqueness, and prevents conflicts with reserved keywords.

  2. Header Validation: Detects empty/duplicate headers, validates against Scala identifier rules, and provides detailed error messages with severity levels.

  3. Content-Based Column Name Inference: Analyzes column content to suggest meaningful names for numbers, dates, emails, URLs, booleans, and names.

  4. Header Improvement Suggestions: Suggests better names for poor headers, converts to camelCase, and provides explanations for suggestions.

The implementation includes a sealed trait hierarchy for header problems, comprehensive error reporting, and a complete test suite covering all features and edge cases.

@Quafadas
Copy link
Owner

Quafadas commented Apr 8, 2025

Wow - thanks for looking at this.

My request would be to see how we can integrate it into the object which reads the CSV i.e. pass a test like this;

https://github.com/Quafadas/scautable/blob/f1bebfc0080f16bd3bae240852e70311d0a766bf/scautable/test/jvm/src/testCsv.scala#L228C1-L232C4

 test("reading data") {
    val csv: CsvIterator[("col1", "col2", "col3")] = CSV.absolutePath(Generated.resourceDir0 + "simple.csv")

  }

But for a file which has duplicated headers in? If we can get that to compile, I'd love to merge this.

@krishivseth
Copy link
Author

Hi Simon,

I've implemented the duplicate headers support. The solution maintains compile-time typing while handling duplicate column names.

I took three approaches:

  1. An auto-detection method that requires no type annotations (detectHeaders)
  2. An explicit method where you provide normalized header types (absolutePathWithDuplicates)
  3. Helper methods for users who need to determine normalized headers first

The CsvIterator now tracks both original and normalized headers (dups get numbered like "col1_1"), with utilities to check the mapping.

I've included tests with a duplicate-header CSV file to demonstrate all approaches. All tests are passing. Let me know if you'd like any adjustments before merging.

@Quafadas
Copy link
Owner

Quafadas commented Apr 9, 2025

This is the part I'm really interested in .

test("access columns by original name") {
    val csv: CsvIterator[("col1", "col2", "col1_1", "col3")] = 
      CSV.absolutePathWithDuplicates(Generated.resourceDir0 + "duplicate_headers.csv")

because this is the "user-facing API" that you are expecting a human to work with and read documentation about.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

@krishivseth
Copy link
Author

I think we can keep detectHeaders as the main way to read the files (since it avoids the type signature issue), but add more intuitive runtime helper methods for accessing columns, like this:

   // Keep as the recommended way to read
   val csv = CSV.detectHeaders("path/to/duplicates.csv")

   // Add helper methods for easier access:
   val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)
   val thirdColValue = csv.getByPosition(2)             // get column by original position

   // Users can easily find which columns are duplicates:
   val duplicates = csv.getDuplicateColumns() // e.g., Map("col1" -> 2)

The getByOccurrence and getByPosition methods feel more user-friendly. The underlying type-safe access (CSV.column["col1_1"]) is available for those who prefer it or need that level of compile-time checking, but it's not the primary way we'd document accessing duplicates. Please let me know what you think, and I'll make a new commit.

@Quafadas
Copy link
Owner

I agree, that we should have as few entry points as possible.

I'm not going to lie; I don't understand what these are or why I would need them.

   // Add helper methods for easier access:
   val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)
   val thirdColValue = csv.getByPosition(2)             // get column by original position

I'm happy for you to make the case but it looks to me as though you're writing a custom version of scala's standard library.

 val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)

Could this be expressed as

csv.drop(1).take(1).head.col1

I'll come out and say that if that is more or less what you are proposing, I am opposed to custom coding stdlib functionality in the library. A guiding principle here is that if we need standard functionality, we should use scalas standard collection library - a terrific piece of work that is maintained by the scala centre.

I am interested in the header deduplication, but I think we need to avoid creating a zoo of entry points. I haven't had the time to review this is great detail, so I could be wrong, but I'm not sure why this deduplication can't be a method on the CSVIterator class that simply deduplicates the headers and returns another CSV iterator with better columns headers?

I'm also not sure what the "type signature issue" is? You'll have to lay that out in more detail.

@Dorodri
Copy link
Contributor

Dorodri commented Apr 11, 2025

Let me jump into this conversation.

I think it is a good idea right here. Particularly, the work on header deduplication is indispensable. The user is virtually blocked to work on the data otherwise.

As my understanding goes, named tuples may be constructed with empty identifiers and/or duplicated ones. Unless the user manipulates the underlying type, they are left with inaccessible columns in these cases.

With this in mind, I would implement this normalization (of duplicated headers) directly on the main entry points for the library — namely absolutePath, resource, etc. As it is a matter of necessity, not only a style choice.

For the other normalizations, I think it is worthwhile mentioning the Scala's literal identifiers feature (see this Stack Overflow answer). The feature by itself makes most normalizations unnecessary.

val `A identifier with spaces` = true
val `1 identifier with a number on the start` = true
val `Backticks are a thing, and they let me use "any string that's accepted by the runtime" as an identifier` = "Awesome, don't you think? It's a shame that this feature is somewhat obscure."

The only exception I can think of is if the column has a backtick on it, as the specification forbids backticks inside the identifier (see this other question). So, if we choose to opt for using backticks, I don't think that other normalizations are crucial.

Finally, for the suggestions feature, I think they ARE useful, and at the same time I fail to imagine a good use case for including them on the library. So I don't have a strong opinion here.

The getByOccurrence and getByPosition methods feel more user-friendly. The underlying type-safe access (CSV.column["col1_1"]) is available for those who prefer it or need that level of compile-time checking, but it's not the primary way we'd document accessing duplicates.

I'm happy for you to make the case but it looks to me as though you're writing a custom version of scala's standard library.

I interpreted this differently, something like

// suppose a csv that contains the headers: ("dup_col", "middle_col", "dup_col", "unique_col")
//                                              vvvvvvvvv we rename duplicated names
val csv: CsvIterator[("dup_col", "middle_col", "dup_col_1", "unique_col")] = CSV.detectHeaders(...)

val duplicated_first_ref = csv.getByOcurrence("duplicated_col", 0) // "dup_col"
val duplicated_second_ref = csv.getByOcurrence("duplicated_col", 1) // "dup_col_1"
val col_by_position = csv.getByPosition(3) // "unique_col"

If this is indeed the case, how would we access values inside map functions, for example? One of the beauties of using named tuples is that we can do things like this:

val csv: CsvIterator[("Name", "Surname", "Age")] =
  CSV.absolutePath(...)
    .mapColumn["Age", Option[Int]](_.toIntOption)

val data: Iterator[(`Full Name`: String, Age: Option[Int])] =
  csv.map(x => (`Full Name` = s"${x.Name} ${x.Surname}", Age = x.Age))

@krishivseth could you enlighten us on this?

@Quafadas
Copy link
Owner

Quafadas commented Apr 11, 2025

@Dorodri My understanding is that this PR was about header management. I am interested in it and it's definitely a goal.

I think the discussion around these extra methods can be moved to a seperate PR unless it's to do with header management? (I don't think it is)
val duplicated_first_ref = csv.getByOcurrence("duplicated_col", 0) // "dup_col"

At least one concern, is where this leads.
CSV.absolutePathWithDuplicates

Are we also going to have
CSV.absolutePathWithDuplicatesAndSemicolonSeperator("/some4thing.csv")

My commentary here stands.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

@Dorodri
Copy link
Contributor

Dorodri commented Apr 11, 2025

I think the discussion around these extra methods can be moved to a seperate PR unless it's to do with header management? (I don't think it is)

Sure! Let's see what krishivseth has to say about this first; maybe we misunderstood the idea.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

I think we share the same concern here.

I don't think it is necessary to add more entry points for this feature, and I would rather have the header deduplication be the default behavior (that is, just be a part of CSV.absolutePath and the other — existing — entry points).

My argument for this is that if the user tries to import a CSV that contains duplicated headers, with the current behavior, they end up with inaccessible fields.*

Having inaccessible fields is surely not a feature, so we can opt to

  1. provide an error message, or
  2. deduplicate the headers.

I think that deduplicating the headers and emitting a warning is a good default behavior. What do you think?

*: For documentation purposes, this behavior of NamedTuples can be tested with this snippet:

import NamedTuple.*

val t = ("1", "2").withNames[("foo", "foo")]
println(t.foo) // output: 1

@Quafadas
Copy link
Owner

Some notes:

#43

@krishivseth
Copy link
Author

Hi @Quafadas and @Dorodri,

Thanks for the feedback, and Simon, really appreciate you sharing PR #44 and your notes on the match type complexity. To me it seems that the core constraint is NamedTuple needing unique names at compile time. We have to normalize duplicates somehow to get a valid type.

Seeing the challenges you hit with the runtime .deduplicateHeaders() in #44, and thinking about Dorodri's point about default handling and the goal of fewer entry points, how about we try this:

  1. Normalize by Default, Inside Existing Methods: Rather than adding a new method like in Header duplication via match type reduction  #44, we modify the existing macros (CSV.absolutePath, CSV.resource, etc.). They'd always run the HeaderUtils normalization logic upfront. This seems necessary anyway, as Dorodri mentioned, to avoid inaccessible columns.

  2. Add Compile-Time Warnings: If normalization actually renames a header (e.g., "col1" becomes "col1_1"), the compiler gives a heads-up: Warning: Duplicate header 'col1' normalized to 'col1_1'. Access using the normalized name. This makes the default behavior visible.

  3. Clean Up Entry Points: We'd remove the extra detectHeaders and absolutePathWithDuplicates methods I currently have in my branch (header-management2), simplifying the API.

  4. Standard Access: Users stick with the normal csv.column["header_name"]. If a name was normalized, the warning tells them which name to use (like csv.column["col1_1"]). This keeps the standard type-safe access.

  5. Keep Discovery Tools: csv.headers and csv.headerNormalizationReport remain so users can always check the final normalized names.

This feels like it integrates the necessary deduplication more directly into the existing flow Simon is familiar with, makes it default as Dorodri suggested, avoids the runtime method complexity from #44, and keeps the API surface cleaner.
Does this sound like a more practical way forward based on everything we've discussed? Please let me know your thoughts.

@Quafadas
Copy link
Owner

I've been AFK for a couple of days so apologies.

I had another look at it and updated it again today, and I'm leaning toward #44 actually solving the problem I wanted it to, in that it appears to pass the tests and fits reasonable well my mental model of how I'd like the API to look.

However, I'm not sure if #44 is the best solution, and I agree the points you made above. I'd be curious to see if you can indeed keep the API clean and have some nice diagnostics and warnings to go with it. I'm not in a rush t merge anything and am still exploring the solution space, so I'll leave this open a bit longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants