Skip to content

Commit a103f17

Browse files
committed
Implements Scanner type for tokenizing nginx configs
Implemented `crossplane.Scanner` that follows the example of other "scanner" types implemented in the Go stdlib. The existing `Lex` uses concurrency to make tokens available to the caller while managing "state". I think this design queue was taken from Rob Pike's 2011 talk on [Lexical Scanning in Go](https://go.dev/talks/2011/lex.slide). If you look at examples from the Go stdlib-- such as `bufio.Scanner` that `Lex` depends on-- you'd find that this isn't the strategy being employed and instead there is a struct that manages the state of the scanner and a method that used by the caller to advance the scanner to obtain tokens. After a bit of Internet archeology, I found [this](https://groups.google.com/g/golang-nuts/c/q--5t2cxv78/m/Vkr9bNuhP5sJ) post on `golang-nuts` from Rob Pike himself: > That talk was about a lexer, but the deeper purpose was to demonstrate > how concurrency can make programs nice even without obvious parallelism > in the problem. And like many such uses of concurrency, the code is > pretty but not necessarily fast. > > I think it's a fine approach to a lexer if you don't care about > performance. It is significantly slower than some other approaches but > is very easy to adapt. I used it in ivy, for example, but just so you > know, I'm probably going to replace the one in ivy with a more > traditional model to avoid some issues with the lexer accessing global > state. You don't care about that for your application, I'm sure. > So: It's pretty and nice to work on, but you'd probably not choose that > approach for a production compiler. An implementation of a "scanner" using the more "traditional" model-- much of the logic is the same or very close to `Lex`-- seems to support the above statement. ``` go test -benchmem -run=^$ -bench "^BenchmarkScan|BenchmarkLex$" github.com/nginxinc/nginx-go-crossplane -count=1 -v goos: darwin goarch: arm64 pkg: github.com/nginxinc/nginx-go-crossplane BenchmarkLex BenchmarkLex/simple BenchmarkLex/simple-10 70982 16581 ns/op 102857 B/op 37 allocs/op BenchmarkLex/with-comments BenchmarkLex/with-comments-10 64125 18366 ns/op 102921 B/op 43 allocs/op BenchmarkLex/messy BenchmarkLex/messy-10 28171 42697 ns/op 104208 B/op 166 allocs/op BenchmarkLex/quote-behavior BenchmarkLex/quote-behavior-10 83667 14154 ns/op 102768 B/op 24 allocs/op BenchmarkLex/quoted-right-brace BenchmarkLex/quoted-right-brace-10 48022 24799 ns/op 103369 B/op 52 allocs/op BenchmarkScan BenchmarkScan/simple BenchmarkScan/simple-10 179712 6660 ns/op 4544 B/op 34 allocs/op BenchmarkScan/with-comments BenchmarkScan/with-comments-10 133178 7628 ns/op 4608 B/op 40 allocs/op BenchmarkScan/messy BenchmarkScan/messy-10 49251 24106 ns/op 5896 B/op 163 allocs/op BenchmarkScan/quote-behavior BenchmarkScan/quote-behavior-10 240026 4854 ns/op 4456 B/op 21 allocs/op BenchmarkScan/quoted-right-brace BenchmarkScan/quoted-right-brace-10 87468 13534 ns/op 5056 B/op 49 allocs/op PASS ok github.com/nginxinc/nginx-go-crossplane 13.676s ``` This alternative to `Lex` is probably a micro-optimization for many use cases. As the size and number of NGINX configurations that need to be analyzed grows, optimization can be a good thing as well as an API that feels familiar to Go developers who might use this tool for their own purposes. Next steps: - Use `Scanner` to "parse" NGINX configurations. I think this should be done in place so that the existing API works as is, but we should also expose a way to allow the caller to provide the scanner. - Deprecate `Lex` in favor of `Scanner`. If we leave `Lex` in place then I don't think we would need a `v2` of the crossplane package (yet).
1 parent 7dbe9ae commit a103f17

File tree

4 files changed

+386
-13
lines changed

4 files changed

+386
-13
lines changed

lex_test.go

+44-13
Original file line numberDiff line numberDiff line change
@@ -252,22 +252,53 @@ func TestLex(t *testing.T) {
252252
}
253253
}
254254

255-
func TestLex_unhappy(t *testing.T) {
256-
t.Parallel()
255+
var lexToken NgxToken //nolint: gochecknoglobals // trying to avoid return value being optimzed away
256+
257+
func BenchmarkLex(b *testing.B) {
258+
var t NgxToken
257259

258-
testcases := map[string]string{
259-
"unbalanced open brance": `http {{}`,
260-
"unbalanced closing brace": `http {}}`,
261-
"multiple open braces": `http {{server {}}`,
262-
"multiple closing braces after block end": `http {server {}}}`,
263-
"multiple semicolons": `server { listen 80;; }`,
264-
"semicolon afer closing brace": `server { listen 80; };`,
265-
"open brace after semicolon": `server { listen 80; {}`,
266-
"braces with no directive": `http{}{}`,
267-
"missing final brace": `http{`,
260+
for _, bm := range lexFixtures {
261+
b.Run(bm.name, func(b *testing.B) {
262+
path := getTestConfigPath(bm.name, "nginx.conf")
263+
file, err := os.Open(path)
264+
if err != nil {
265+
b.Fatal(err)
266+
}
267+
defer file.Close()
268+
b.ResetTimer()
269+
270+
for i := 0; i < b.N; i++ {
271+
if _, err := file.Seek(0, 0); err != nil {
272+
b.Fatal(err)
273+
}
274+
275+
for tok := range Lex(file) {
276+
t = tok
277+
}
278+
}
279+
})
268280
}
269281

270-
for name, c := range testcases {
282+
lexToken = t
283+
}
284+
285+
//nolint:gochecknoglobals
286+
var unhappyFixtures = map[string]string{
287+
"unbalanced open brance": `http {{}`,
288+
"unbalanced closing brace": `http {}}`,
289+
"multiple open braces": `http {{server {}}`,
290+
"multiple closing braces after block end": `http {server {}}}`,
291+
"multiple semicolons": `server { listen 80;; }`,
292+
"semicolon afer closing brace": `server { listen 80; };`,
293+
"open brace after semicolon": `server { listen 80; {}`,
294+
"braces with no directive": `http{}{}`,
295+
"missing final brace": `http{`,
296+
}
297+
298+
func TestLex_unhappy(t *testing.T) {
299+
t.Parallel()
300+
301+
for name, c := range unhappyFixtures {
271302
c := c
272303
t.Run(name, func(t *testing.T) {
273304
t.Parallel()

scanner.go

+230
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
package crossplane
2+
3+
import (
4+
"bufio"
5+
"errors"
6+
"fmt"
7+
"io"
8+
"strings"
9+
)
10+
11+
// Token is a lexical token of the NGINX configuration syntax.
12+
type Token struct {
13+
// Text is the string corresponding to the token. It could be a directive or symbol. The value is the actual token
14+
// sequence in order to support defining directives in modules other than the core NGINX module set.
15+
Text string
16+
// Line is the source starting line number the token within a file.
17+
Line int
18+
// IsQuoted signifies if the token is wrapped by quotes (", '). Quotes are not usually necessary in an NGINX
19+
// configuration and mostly serve to help make the config less ambiguous.
20+
IsQuoted bool
21+
}
22+
23+
type scannerError struct {
24+
msg string
25+
line int
26+
}
27+
28+
func (e *scannerError) Error() string { return e.msg }
29+
func (e *scannerError) Line() int { return e.line }
30+
31+
func newScannerErrf(line int, format string, a ...any) *scannerError {
32+
return &scannerError{line: line, msg: fmt.Sprintf(format, a...)}
33+
}
34+
35+
// LineNumber reports the line on which the error occurred by finding the first error in
36+
// the errs chain that returns a line number. Otherwise, it returns 0, false.
37+
//
38+
// An error type should provide a Line() int method to return a line number.
39+
func LineNumber(err error) (int, bool) {
40+
var e interface{ Line() int }
41+
if !errors.As(err, &e) {
42+
return 0, false
43+
}
44+
45+
return e.Line(), true
46+
}
47+
48+
// Scanner provides an interface for tokenizing an NGINX configuration. Successive calls to the Scane method will step
49+
// through the 'tokens; of an NGINX configuration.
50+
//
51+
// Scanning stops unrecoverably at EOF, the first I/O error, or an unexpected token.
52+
//
53+
// Use NewScanner to construct a Scanner.
54+
type Scanner struct {
55+
scanner *bufio.Scanner
56+
lineno int
57+
tokenStartLine int
58+
tokenDepth int
59+
repeateSpecialChar bool // only '}' can be repeated
60+
prev string
61+
}
62+
63+
// NewScanner returns a new Scanner to read from r.
64+
func NewScanner(r io.Reader) *Scanner {
65+
s := &Scanner{
66+
scanner: bufio.NewScanner(r),
67+
lineno: 1,
68+
tokenStartLine: 1,
69+
tokenDepth: 0,
70+
repeateSpecialChar: false,
71+
}
72+
73+
s.scanner.Split(bufio.ScanRunes)
74+
75+
return s
76+
}
77+
78+
// Scan reads the next token from source and returns it.. It returns io.EOF at the end of the source. Scanner errors are
79+
// returned when encountered.
80+
func (s *Scanner) Scan() (Token, error) { //nolint: funlen, gocognit, gocyclo
81+
var tok strings.Builder
82+
83+
lexState := skipSpace
84+
newToken := false
85+
readNext := true
86+
esc := false
87+
88+
var r, quote string
89+
90+
for {
91+
switch {
92+
case s.prev != "":
93+
r = s.prev
94+
s.prev = ""
95+
case readNext:
96+
if !s.scanner.Scan() {
97+
if tok.Len() > 0 {
98+
return Token{Text: tok.String(), Line: s.tokenStartLine, IsQuoted: lexState == inQuote}, nil
99+
}
100+
101+
if s.tokenDepth > 0 {
102+
return Token{}, &scannerError{line: s.tokenStartLine, msg: "unexpected end of file, expecting }"}
103+
}
104+
105+
return Token{}, io.EOF
106+
}
107+
108+
nextRune := s.scanner.Text()
109+
r = nextRune
110+
if isEOL(r) {
111+
s.lineno++
112+
}
113+
default:
114+
readNext = true
115+
}
116+
117+
// skip CRs
118+
if r == "\r" || r == "\\\r" {
119+
continue
120+
}
121+
122+
if r == "\\" && !esc {
123+
esc = true
124+
continue
125+
}
126+
127+
if esc {
128+
esc = false
129+
r = "\\" + r
130+
}
131+
132+
switch lexState {
133+
case skipSpace:
134+
if !isSpace(r) {
135+
lexState = inWord
136+
newToken = true
137+
readNext = false // re-eval
138+
s.tokenStartLine = s.lineno
139+
}
140+
continue
141+
142+
case inWord:
143+
if newToken {
144+
newToken = false
145+
if r == "#" {
146+
tok.WriteString(r)
147+
lexState = inComment
148+
s.tokenStartLine = s.lineno
149+
continue
150+
}
151+
}
152+
153+
if isSpace(r) {
154+
return Token{Text: tok.String(), Line: s.tokenStartLine}, nil
155+
}
156+
157+
// parameter expansion syntax (ex: "${var[@]}")
158+
if tok.Len() > 0 && strings.HasSuffix(tok.String(), "$") && r == "{" {
159+
tok.WriteString(r)
160+
lexState = inVar
161+
s.repeateSpecialChar = false
162+
continue
163+
}
164+
165+
// add entire quoted string to the token buffer
166+
if r == `"` || r == "'" {
167+
if tok.Len() > 0 {
168+
// if a quote is inside a token, treat it like any other char
169+
tok.WriteString(r)
170+
} else {
171+
quote = r
172+
lexState = inQuote
173+
s.tokenStartLine = s.lineno
174+
}
175+
s.repeateSpecialChar = false
176+
continue
177+
}
178+
179+
// special characters treated as full tokens
180+
if isSpecialChar(r) {
181+
if tok.Len() > 0 {
182+
s.prev = r
183+
return Token{Text: tok.String(), Line: s.tokenStartLine}, nil
184+
}
185+
186+
// only } can be repeated
187+
if s.repeateSpecialChar && r != "}" {
188+
return Token{}, newScannerErrf(s.tokenStartLine, "unxpected %q", r)
189+
}
190+
191+
s.repeateSpecialChar = true
192+
if r == "{" {
193+
s.tokenDepth++
194+
}
195+
196+
if r == "}" {
197+
s.tokenDepth--
198+
if s.tokenDepth < 0 {
199+
return Token{}, &scannerError{line: s.tokenStartLine, msg: `unexpected "}"`}
200+
}
201+
}
202+
203+
tok.WriteString(r)
204+
return Token{Text: tok.String(), Line: s.tokenStartLine}, nil
205+
}
206+
207+
s.repeateSpecialChar = false
208+
tok.WriteString(r)
209+
case inComment:
210+
if isEOL(r) {
211+
return Token{Text: tok.String(), Line: s.tokenStartLine}, nil
212+
}
213+
tok.WriteString(r)
214+
case inVar:
215+
tok.WriteString(r)
216+
if r != "}" && !isSpace(r) {
217+
continue
218+
}
219+
lexState = inWord
220+
case inQuote:
221+
if r == quote {
222+
return Token{Text: tok.String(), Line: s.tokenStartLine}, nil
223+
}
224+
if r == "\\"+quote {
225+
r = quote
226+
}
227+
tok.WriteString(r)
228+
}
229+
}
230+
}

0 commit comments

Comments
 (0)