Support for the Cluster Type

# Feature request

CodeCharta currently supports, "nodes" and "edges". Nodes are always grouped in hierarchical clusters, i.e. "app/thing/subthing/file.ts" has the cluster hierarchy app -> thing -> subthing -> file.ts. But currently there is no support for clusters orthogonal to the file hierarchy, i.e. ownership of files or coverage per test or cohesive clusters. Analyzing and visualizing them would be a huge gain for CodeCharta. 

## Description

As an auditor, I want cluster support so that I can inspect clusters orthogonal to the file hierarchy.

The following is not a precise feature description but more of a design document for the .cc.json and for how to visualize this third "cluster" type, which would be in addition to the "nodes" and "edges" that are currently supported. 

It includes different ideas for what to visualize to make the design stronger and more future-proof. If we can have a simple design in the cc.json that can cover a lot of use cases, then we are happier, so to speak.

## Clusters orthogonal to file hierarchy

There are a bunch of clusters orthogonal to the file hierarchy:

- Author ownerships in terms of area. Hover over author, see how much/what he owns. 
- Issue changes in area. hover over issue, see most recent/largest impact and what it affected. 
- Per test coverage in area. Hover over test, see how much it covers/how much it covers beyond its intended scope. 
- Which commit/feature changed what files. Hover over commit/see which was the largest one, what files where changed.
- How much an element belongs where it is, in LoC. aka cluster analysis. 

### Authors

Each file has main author, the one responsible for most of the code that is still there. But right now we cannot show how distributed this knowledge is. If we had clusters we could show them in a new side bar and by hovering over the author or author cluster we can see how distributed the knowledge is. This data can be gathered from git.

<img width="592" height="233" alt="Image" src="https://github.com/user-attachments/assets/50855aea-43a0-44c9-ab4e-2cb623d35a58" />

### Issues

Like author ownership it is also interesting how much is changed per story. Are there stories that affect more than others, is there a feature that touches everything? The hierarchy theme 1 -> epic 1 -> feature 1 -> stories is just for illustration purpose. It is not fixed to this format. This data can be gathered from git when commits are tagged with the story they belong to. Issue hierarchy can be extracted from Jira.

<img width="592" height="233" alt="Image" src="https://github.com/user-attachments/assets/8c56bfc1-30b2-4423-9a43-1dc89f8ee357" />

### Coverage

Coverage is also interesting. Which tests have the most coverage? Are there tests that touch a lot of the code base? Are there tests that touch only very little? Again the hierarchy is up to the user to define. This data can be gathered from coverage reports.

<img width="592" height="233" alt="Image" src="https://github.com/user-attachments/assets/dabacfdc-6759-45e9-bac5-6ebd1d812f55" />

### Commits

Are there commits that were very widespread and touched a lot of files? This is the cousin to issues, just on a per-commit basis. This data can be gathered from git.

<img width="592" height="233" alt="Image" src="https://github.com/user-attachments/assets/d04ef631-2d1e-4703-975d-478b5ba1178b" />

### Belongness

With this we could visualize which clusters there actually are, if we look at the temporal coupling. Generate the temporal coupling, then use k-means clustering or similar to figure out which files actually belong together according to their temporal dependencies. These clusters can then be visualized as well. 

<img width="592" height="233" alt="Image" src="https://github.com/user-attachments/assets/fe15ca35-e493-4d7b-a15d-222c7582bee1" />

## CC.json design

Now how do we represent this information? I think the following would make sense:

```json
{
  "clusters": [
    {
      "name": "author_ownership",
      "description": "How much of the code base is owned per author?",
      "children": [
        {
          "cluster": "/MaibornWolff/Max Mustermann",
          "nodes": [
            {
              "nodeName": "/root/src/app/ui/flows.ts",
              "attributes": {
                "loc": 120,
                "rloc": 89
              }
            }
          ]
        }
      ]
    },
    {
      "name": "issue_changes",
      "description": "How much of the code base is changed by the issues?",
      "children": [
        {
          "cluster": "/Theme 1/Epic 1/Feature 1/Story CC-1234",
          "nodes": [
            {
              "nodeName": "/root/src/app/ui/flows.ts",
              "attributes": {
                "loc": 120,
                "rloc": 89
              }
            }
          ]
        }
      ]
    },
    {
      "name": "test_coverage",
      "description": "How much of the code base do the tests cover?",
      "children": [
        {
          "cluster": "/Unit Tests/Group 1/Subgroup 1/Test-Something.java",
          "nodes": [
            {
              "nodeName": "/root/src/app/ui/flows.ts",
              "attributes": {
                "loc": 120,
                "rloc": 89
              }
            }
          ]
        }
      ]
    },
    {
      "name": "commit_changes",
      "description": "How much of the code base is changed for each commit?",
      "children": [
        {
          "cluster": "/Main/2025-12-02 12b345a",
          "nodes": [
            {
              "nodeName": "/root/src/app/ui/flows.ts",
              "attributes": {
                "loc": 120,
                "rloc": 89
              }
            }
          ]
        }
      ]
    },
    {
      "name": "temporal_belongness",
      "description": "What are the actual clusters in which the files belong?",
      "children": [
        {
          "cluster": "/Cluster 1/SubCluster 1",
          "nodes": [
            {
              "nodeName": "/root/src/app/ui/flows.ts",
              "attributes": {
                "loc": 120,
                "rloc": 89
              }
            }
          ]
        }
      ]
    }
  ]
}
```

## Calculate the percentages

Most of these examples work the metric rloc or loc. Say you have a file with 200 rloc.

- When this file is in a author_ownership cluster with rloc:50, then the author owns 25% of the file. 
- When this file is in a issue_changes cluster with rloc:50, then the issue changed 25% of the file. 
- When this file is in a test_coverage cluster with rloc:50, then the test covers 25% of the file. 
- When this file is in a commit_changes cluster with rloc:50, then the commit changed 25% of the file. 

## Open questions

- author_ownership, issue_changes and  commit_changes all work with rloc. We have to be careful how we calculate this: 
- - Say an author made lots of changes to a file at some point with commit A. 175 rloc out of a 200 rloc file would be 75%. 
- - But if this file is later changed drastically again with commit B. Maybe even 200 rloc out of the 200 rloc file, 100%. 
- - The earlier commit changed 75% of the file but 0% of these changes remain today. Which is now correct to display? Both have value. The first one shows authors that made lots of changes, the latter shows what others "own" the file. 
- How to aggregate author_ownership, issue_changes, test_coverage and  commit_changes? 
- - In general it makes sense to show the relative effect on the code base that clusters have. I.e. which authors have the most ownership. 
- - But how do we sum that up? We cannot simply sum up the rloc of clusters because these rloc probably overlap. We could average it but that would mean we only take the overlap. So that means we have to sum in the analysis. 
- Temporal belongness is a bit special. While the other examples work on a typical area metric (rloc), the temporal belongness works differently. It does not say anything about what lines belong there (rloc) but has it's own relative metric "belongness". This metric describes how much the file belongs where it currently is. We can calculate this because we know the temporal_coupling from git. A file that is coupled to lots of files outside of its folder has a very low belongness score. But this score is not based on rloc. Which brings up the question if the rloc representation for the other clusters is the right fit. 

In general I think this design is not done. There are lots of rough edges to figure out, maybe change the representation in the cc.json or visual or both. The whole cluster attribute is rloc and the visualization figures stuff out is probably also not the right way (see "aggregating" clusters or the temporal belongness cluster problems). 

To figure this out I think we need to add a bit more examples and iterate. Any ideas? :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for the Cluster Type #4411

Feature request

Description

Clusters orthogonal to file hierarchy

Authors

Issues

Coverage

Commits

Belongness

CC.json design

Calculate the percentages

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for the Cluster Type #4411

Description

Feature request

Description

Clusters orthogonal to file hierarchy

Authors

Issues

Coverage

Commits

Belongness

CC.json design

Calculate the percentages

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions