Exponential time required to format nested conditional code #52

zaneduffield · 2024-04-11T00:27:21Z

Currently there's a performance issue with parsing files that contain deeply nested conditional directives.

The approach the parser takes right now is to reprocess the entire file for each conditional branch path.
For example, in the following code

{$ifdef A}
Foo({$ifdef B} B {$else} NB {$endif});
{$else}
Bar;
{$endif}

{$ifdef A2}
Foo2({$ifdef B2} B2 {$else} NB2 {$endif});
{$else}
Bar2;
{$endif}

4 passes are performed:

level 0: branch 0 & level 1: branch 0
- ```
Foo(B);
Foo2(B2);
```
level 0: branch 0 & level 1: branch 1
- ```
Foo(NB);
Foo2(NB2);
```
level 0: branch 1 & level 1: branch 0
- ```
Bar;
Bar2;
```
level 0: branch 1 & level 1: branch 1
- ```
Bar;
Bar2;
```

In isolation, this makes sense. It lets the parser explore all of the available code, and generate logical lines that are equivalent to what the a compiler could see after 'preprocessing'.

The problem is that all of the surrounding code in the file also has to be parsed over and over, even if it is unconditional.

A more efficient parser implementation could restrict itself to re-parsing only the parts of the code that are affected by the conditionals. This is easier said than done, but would be important for formatting large files with nested conditional directives.

A pathological case that the current parser will effectively never finish parsing

Foo(
  {$ifdef A}
    {$ifdef B}
      {$ifdef C}
        {$ifdef D}
          {$ifdef E}
            {$ifdef F}
            {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
            {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
            {$endif}
          {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
          {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
          {$endif}
        {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
        {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
        {$endif}
      {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
      {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
      {$endif}
    {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
    {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
    {$endif}
  {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
  {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif} {$elseif}
  {$endif}
)

zaneduffield · 2025-01-06T03:34:05Z

Here's a more realistic case that causes a file to be re-parsed a cool 524,288 times

{$ifdef verXXX}
{$else}
  {$ifdef verXXX}
  {$else}
    {$ifdef verXXX}
    {$else}
      {$ifdef verXXX}
      {$else}
        {$ifdef verXXX}
        {$else}
          {$ifdef verXXX}
          {$else}
            {$ifdef verXXX}
            {$else}
              {$ifdef verXXX}
              {$else}
                {$ifdef verXXX}
                {$else}
                  {$ifdef verXXX}
                  {$else}
                    {$ifdef verXXX}
                    {$else}
                      {$ifdef verXXX}
                      {$else}
                        {$ifdef verXXX}
                        {$else}
                          {$ifdef verXXX}
                          {$else}
                            {$ifdef verXXX}
                            {$else}
                              {$ifdef verXXX}
                              {$else}
                                {$ifdef verXXX}
                                {$else}
                                  {$ifdef verXXX}
                                  {$else}
                                    {$ifdef verXXX}
                                    {$else}
                                    {$endif}
                                  {$endif}
                                {$endif}
                              {$endif}
                            {$endif}
                          {$endif}
                        {$endif}
                      {$endif}
                    {$endif}
                  {$endif}
                {$endif}
              {$endif}
            {$endif}
          {$endif}
        {$endif}
      {$endif}
    {$endif}
  {$endif}
{$endif}

zaneduffield · 2025-01-30T04:30:04Z

@jgardn3r and I discussed a possible rewrite of the parser that could more efficiently handle highly nested conditional code.

The idea is to use a kind of state machine, where a stack of states is maintained to represent the current position in the structure of the source. The machine would process one token at a time, and decide what to do with it based only on the current stack of states.

Where this shines is when it comes to conditional directives. By decoupling the state of the parsing from the code that does the parsing, it becomes trivial to make a copy of the current state before exploring one of many parsing paths.

The only hard thing is to know when to 'return' from the branch of conditional parsing; it isn't sufficient to return once the end of the conditional branch is reached, because the parsing of what comes after the conditional code can depend on what was inside the conditional code. One solution to this is to only return from the branch when a logical line is completed at the same level as the start of the conditional branch.

Unfortunately, implementing something like this would be an almost complete rewrite of the parser. This is a hard sell, not only because the parser is already quite complex, but also because there are simpler mitigations to this performance problem involving skipping the complex conditional sections.

zaneduffield · 2025-03-25T22:42:36Z

@jgardn3r and I discussed an alternative solution to this problem that doesn't involve completely rewriting the parser.

Current Approach

The current approach is based on exploring the conditional code in 'levels'.
It calculates the maximum branching factor at each conditional directive depth and then enumerates the cartesian product of the ranges of branching factors at each depth.

Mathematically, the set of all conditional directive paths we currently explore could be represented as
$$P = \prod_{depth=0}^{maxdepth} { 1, 2, \dots, maxwidth_{\text{depth}} }$$

For example, in the following code

// width 3 at level 0
{$if A} {$elseif B} {$elseif C} {$endif} 
// width 1 at level 0
{$if A} 
  // width 2 at level 1
  {$if A}
  {$else}
    // width 3 at level 2
    {$if A} {$elseif B} {$elseif C} {$endif}
  {$endif}
{$endif}

The maxdepth is 2, and maxwidth_0 is 3, maxwidth_1 is 2, maxwidth_2 is 3. The paths explored are

Problem

The problem with this approach is that it creates many duplicate paths through the file, because it will enumerate all the possibilities inside of a branch that wasn't even taken.

Alternative Approach

Instead of exploring the conditional code in 'levels', we could explore the branches more directly by keeping track of which branches have been fully explored.

In pseudo-code

while exists(branch not fully explored)
- create a new 'pass' of the file
- pass:
- take tokens to the next conditional directive, or mark the current branch (if any) as fully explored and return
- 'take' the first branch of the directive that hasn't yet been fully explored (or the first if all have been explored)
- recurse into pass:

zaneduffield added the performance label Jan 12, 2025

zaneduffield linked a pull request Mar 31, 2025 that will close this issue

Explore conditional directive paths more efficiently #231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential time required to format nested conditional code #52

Exponential time required to format nested conditional code #52

zaneduffield commented Apr 11, 2024

zaneduffield commented Jan 6, 2025

zaneduffield commented Jan 30, 2025

zaneduffield commented Mar 25, 2025

Exponential time required to format nested conditional code #52

Exponential time required to format nested conditional code #52

Comments

zaneduffield commented Apr 11, 2024

zaneduffield commented Jan 6, 2025

zaneduffield commented Jan 30, 2025

zaneduffield commented Mar 25, 2025

Current Approach

Problem

Alternative Approach