Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Using a pure Go parser for Python #2320

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

linzhp
Copy link
Contributor

@linzhp linzhp commented Oct 21, 2024

Replacing tree-sitting with github.com/go-python/gpython, which is in pure Go. This re-enable the ability to cross-compile Gazelle from other platforms to macOS.

Note that gpython doesn't preserve Python comments, while Python extension read Gazelle directives in Python comments (not just BUILD.bazel file comments), I have to scan the Python file for a second time to find the comments, but this doesn't include inline comments. So this will break existing usage of Gazelle directives inline Python comments, which I think should be very rare. In fact, unless I miss anything, I don't think we should allow Gazelle directives in Python comments in the first place. They are designed to be in BUILD.bazel files.

Fixes #1913

@alexeagle FYI.

@linzhp linzhp requested a review from f0rmiga as a code owner October 21, 2024 20:18
@linzhp linzhp marked this pull request as draft October 21, 2024 20:30
@linzhp linzhp marked this pull request as ready for review October 22, 2024 01:57
@aignas
Copy link
Collaborator

aignas commented Oct 22, 2024

FYI, @dougthor42, you might be using the feature of comments in the source code, would you have time to check this PR please?

@linzhp
Copy link
Contributor Author

linzhp commented Oct 22, 2024

hmm, this only support Python 3.4 or below... Marking it as draft for now

@linzhp linzhp marked this pull request as draft October 22, 2024 04:38
@dougthor42
Copy link
Contributor

dougthor42 commented Nov 13, 2024

@linzhp Any update on this? TBH it kinda looks like github.com/go-python/gpython is defunct. It hasn't had any updates since Feb.

I've run into another issue with the tree-sitter-based parser (#2396) which is why I'm curious.

I don't think we should allow Gazelle directives in Python comments in the first place.

I would be quite sad if gazelle annotations (comments in python files) were removed. That's the only way we've been able to deal with pre-existing circular imports within our code base when migrating to Bazel.

@dougthor42
Copy link
Contributor

I wonder if it would be easier to just make our own go parser for python? I've already done a bit while working on a hack for #2396.

The key thing to note is that Gazelle only needs to know a very small subset of python grammar:

  • imports
  • comments
  • the if __name__ == "__main__": clause

For the "happy path", comments and if __name__ clause are pretty easy, with imports being a little more difficult but not absurd. Comments are # to end of line, if __name__ has only a small number of permutations that are needed to check. Imports have 4 forms that need to be supported, but it's not that difficult.

Things start to get tricky when:

  • imports span multiple lines:
    from foo import (
        bar,
        baz
    )
  • Someone puts a gazelle annotation comment in a string or another comment:
    """Make sure to use '# gazelle:ignore foo.bar' because ..."
    # but don't use '# gazelle:include_dep foo.bar' because ...
  • ??? other edge cases I'm sure.

For the record, here's some very-WIP pure go parser for the happy path above. This is what I'm playing with to address #2396

// file_parser.go

// A really really dumb python parser that's "good enough". Hopefully.
func (p *FileParser) naiveFileParser(code []byte) (*ParserOutput, error) {

	codeStr := string(code[:])

	// Parse imports
	modules, err := p.naiveImportParser(codeStr)
	if err != nil {
		return nil, err
	}
	p.output.Modules = modules

	// Parse comments
	// TODO: find comment-in-string or comment-in-comment cases
	commentPattern := regexp.MustCompile(`(?m)#.+$`)
	commentStrings := commentPattern.FindAllString(codeStr, -1)
	var comments []comment
	for _, s := range commentStrings {
		comments = append(comments, comment(s))
	}
	p.output.Comments = comments

	// Parse 'if __name__' block
	// TODO: Support other similar forms like b == a or single quotes
	p.output.HasMain = strings.Contains(codeStr, `if __name__ == "__main__":`)
	return &p.output, nil
}

func indexOf(element string, data []string) (uint32, error) {
	for k, v := range data {
		if element == v {
			return uint32(k), nil
		}
	}
	err := errors.New(fmt.Sprintf("Did not find %q in data.", element))
	return 0, err
}

// naiveImportParser runs a very naive, mostly regex-based, parse of the python source code
// as a string.
// NOTE: This does not support multi-line imports like
//    from foo import (
//	bar,
//	baz
//    )
// Nor does this _really_ support comma-separated like
//    from collections.abc import Callable, Iterator
func (p *FileParser) naiveImportParser(code string) ([]module, error) {
	// Find `import a`, `from a import b`, `import a as b`, `from a import b as c`
	// The pattern doesn't need to be bullet-proof; we'll remove things like comments later.
	importPattern := regexp.MustCompile(`(?m)^(\w*import .+)|(\w*from .+ import .+)( as .+)?`)
	importStatements := importPattern.FindAllString(code, -1)

	// Also support Windows files >:-(
	lines := strings.Split(strings.ReplaceAll(code, "\r\n", "\n"), "\n")

	var modules []module
	for _, v := range importStatements {
		lineNum, err := indexOf(strings.TrimSpace(v), lines)
		if err != nil {
			return nil, err
		}

		// Remove any comments that might be present.
		statement, _, _ := strings.Cut(v, "#")
		statement = strings.TrimSpace(statement)

		// At this point we have something like "import foo", "import foo as bar",
		// "from foo import bar", or "from foo import bar as baz"
		// So we need to "do the needful" to make the dotted import 'foo.bar'
		dottedImport := ""
		from := ""

		// We don't care about aliases. Drop them.
		statement, _, _ = strings.Cut(statement, " as ")

		// 'from foo import bar' is the same as 'import foo.bar'. Because we don't need
		// to support 'from foo import some_attribute'.
		// dougthor42 is lucky here. A "real" implementation might need to worry about this.
		if strings.HasPrefix(statement, "from ") {
			first, second, _ := strings.Cut(statement[5:], " import ")

			// second might be a list of items "from foo import A, B, c, d"
			// If that's the case, we're (probably) importing Attributes so we don't
			// need any of them.
			// TODO: The more correct thing to do here would be to make N module structs.
			// But for now, I'm lazy.
			if strings.Contains(second, ",") {
				second = strings.Split(second, ",")[0]
			}

			// "from foo import bar" --> "foo.bar"
			dottedImport = fmt.Sprintf("%s.%s", first, second)
			from = first
		} else {
			// "import foo.bar" --> "foo.bar"
			dottedImport = statement[7:]
		}

		mod := module{
			Name:       dottedImport,
			LineNumber: lineNum,
			Filepath:   p.relFilepath,
			From:       from,
		}
		modules = append(modules, mod)
	}

	return modules, nil
}

@linzhp
Copy link
Contributor Author

linzhp commented Nov 13, 2024

Thanks for being interested. My next attempt would be using antlr to parse Python, but I haven't got there yet. Maybe you can give it a try?

I would be quite sad if gazelle annotations (comments in python files) were removed. That's the only way we've been able to deal with pre-existing circular imports within our code base when migrating to Bazel.

Can you use # gazelle:python_ignore_dependencies in BUILD.bazel files instead?

@dougthor42
Copy link
Contributor

My next attempt would be using antlr to parse Python

The readme says that grammar only targets python 3.6 🫤

Can you use # gazelle:python_ignore_dependencies in BUILD.bazel files instead?

Sadly no, as that would apply at the Bazel package level and thus impact all targets. We need target-level (which for us are file-level) ignores # gazelle:ignore foo.bar.baz and includes # gazelle:include_dep //src/some:target.


I see a couple options:

  1. Work on making the go helper cross-compile. Sadly I wouldn't know where to start with this, and I don't have access to a Mac anyway.
  2. Make our own parser in go. This means there's a maintenance burden of keeping it up to date with python grammar changes, but given that we only care about a small subset of grammar, this might not add much overhead.
  3. Revert back to the python-based parser and:
    • work on making it hermetic (don't rely on system interpreter, which was the whole point to migrating to a go parser)
    • Run it in parallel because the old python one was slow. I'm still a novice with go, but it's possible that goroutines could help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot cross-compile Gazelle extension
3 participants