I began writing expandfile in 2002. See expandfile-history.
Expandfile reads text files, expands macros, and writes a new text file. I use it to create HTML pages for web sites, and to perform other text transformation tasks.
My Project
These are my informal notes as I convert the Perl program expandfile to the Go language.
Go was started at Google in 2007 by Ken Thompson, Robert Griesemer, and Rob Pike. The language was released in 2009. The language was designed for large code projects and concurrent processing. It is easy to learn and use. It compiles to machine code. It has static types and garbage collects memory. It is free.
(Oct 2021) I began porting expandfile to Go 1.7.2. (I installed it with Homebrew).
(Jan 2022) I restarted porting expandfile to Go 1.22.1. (Homebrew now has 1.22.4) I had expandfile mostly running, except for XML. There were performance problems and inconsistencies with the Multics expansions.
(Jun 2024) I restarted porting expandfile to Go 1.22.1.
Background: Expandfile in Perl
I wrote expandfile in Perl. (I have used Perl since 1996.)
The main expandfile application is about 75 lines of Perl. It uses a library module, expandfile.pm, of about 1500 lines of Perl, plus other smaller libraries for SQL and XML processing. expandfile uses about a dozen Perl library modules from CPAN.
In some senses, I wrote expandfile in a style of "PL/I written in Perl." One can write very terse Perl statements that do a lot, but I chose to write less dense code that I would be able to understand later. This made it easier to translate to other languages, such as Python.
expandfile is open source and available without fee from GitHub. Documentation for the program is online.
Expandfile in Go
In Go the main expandfile application is 160 lines of Go. The application uses a library, xflib.go, of about 2523 lines of Perl, including SQL processing but not XML. The Go version uses about a dozen Go library modules including SQL.
I tried to make my Go code clear and modular. Initially I tried to translate one line of Perl to one or more lines of Go. See below for some lessons.
I have not yet put the Go version of expandfile into Github.
History
(12 Dec 2021) The basic expandfile functions work. testexpandfile runs and most tests pass:
- Basic variable setting and expansion works, including nested evaluation.
- Arithmetic and string builtin functions work.
- Builtin functions that loop and generate output by expanding an iterator block mostly work:
- *ssvloop works.
- *sqlloop works.
- *dirloop lists files, but some file attributes are not set.
- *xmlloop To be written.
- Multics formatting with and without SQL lookup works.
- (20 Dec 2021) Testing expansion of various Multics source files revealed bugs. Fixed them.
Known bugs (15 Jan 2022)
- Function getFirstSentence(), used to create TITLE tags in Multics expansions, returns with "..." where the Perl version does not. It also returns one more word. I need to twiddle with this function to make it identical to the Perl regexp version.
- Some macro expansions return trailing newlines in the Go version where the Perl version does not. For example, macro getcomma3 returns a trailing NL in Go, not in Perl.
Work to be done (19 Jan 2022)
- Write XML handling code for xflib, based on the Perl program readbindxml.pm. The popular XML libraries for Go require the user to compile the XML schema into a Go program -- this won't work for expandfile, which wants to discover the schema by examining the result of an Xpath query like "/*/*". So I need to find a better library to learn how to do this.
- (done) Package epm was renamed to xflib.
- Add testmx.htmt, testperlisms.htmt, testbracketerr.htmt, testsetscale.htmt and various Multics expansion tests to the test suite testexpandfile for both Perl and Go.
- Compile pages from multicians.org with Go. Make sure nothing crashes or comes out garbage. If there are differences from Perl, decide what to fix and how. If Go is too slow, fix. (see Performance below)
- Compile the whole Multics site and compare all files with Perl's output. Should be identical. Currently there are differences: the TITLE attribute of glossary references has one more word, and the extra newline characters mentioned above.
- (done) Turn off GO111MODULE and set up the modern build system.
Progress (02 Apr 2024)
Restarted this project.
- Changed regexps to quote the braces.
- Changed regexp package to wasilabs/re2.
- Renamed epm to xflib.
- Turned off GO111MODULE and set up go.mod.
Bugs found
- Bug 001 2024-03-04
- (6x) trace: ExpandMulticsBody SQL
- trace: ExpandMulticsBody {[]} nomatch
- re2 syntax is different from Go rexp. Rewrote code to use re2. Regexps I think should match are not matching. Little regex test program fails.
- See https://cran.r-project.org/web/packages/re2/vignettes/re2_syntax.html
- I decided to rewrite ExpandMulticsBody to not use regular expressions. This should speed things up even more (see below) and be simpler.
Plans (06 Apr 2024)
Rewrite ExpandMulticsBody. Decide if we do the state machine or regexps.
Get test to pass.
Write more tests.
Try testexpandfile.
Measure performance and see if it is adequate.
What about XML.
Progress (16 Jun 2024)
see expandfile4.e.
Performance
(17 dec 2021) Expanding mx-net.htmx (2126 lines) took 0.889 seconds with Perl and 24.170 seconds with Go -- a factor of 27x slower. This was unacceptable. Recompiling the whole Multics site (478 files) would take over 3 hours, instead of about 6 minutes. My first Go version of expandfile used a simple set of functions to simulate Perl lists.
I analyzed what the Perl expandfile was doing. Basically it made 14 passes over the file, doing one transform at a time: block binding, variable and builtin expansion, Multics lookups (4 types), Multics formatting (4 types done twice). Each pass replaced the whole copy of the file with a changed one. Perl is very efficient about this; Go is less so. (Perhaps the Go version of the program invokes the garbage collector a lot, or it is not as fast as Perl's?) Further investigation with the profiler will help understand this. I tried using the Go profiling tools but wasn't able to get them working.
I did some experiments. I found that most of the slowdown came from the "Multics" source constructs that replace a string like "edited by {[VanVleck Tom Van Vleck [THVV]]}" with a hyperlink "edited by Tom Van Vleck [THVV]". This construct looks up an identifier in my local MySQL database and outputs a link.
I investigated whether MySQL access was slower in Go than in Perl by instrumenting the lookupSQL function. The 130 lookup calls in mx-net.htmx averaged a millisecond or two: not enough to explain over 20 seconds' delay.
I rewrote expandMulticsBody() and cleanRef() to perform all the formatting, lookup, and unwrapping operations in a single pass over their inputs, rather than repeated passes. My Perl regular expressions became character loops over a string with a state machine. This reduced the number of passes over the input from 14 to 3. Expanding mx-net.htmx executed in 3.537 seconds, a factor of 7 improvement. Go is about 4x slower than Perl. Compiling the source file with the Multics constructs removed takes about 2 seconds: still 2x as long as Perl.
Furthermore, my translation of regexps to state machines did not handle all cases correctly. Other projects claimed my time and I set the Go version aside.
(aug 2022) I learned more about Go and rewrote the list handling routines to use a more object-oriented struct that wrapped a container/list instance instead of a fixed array. This made the Go version about 4x slower than Perl, a factor of 7 more efficient.
I think the remaining performance issues have to do with regular expression caching. epm.go used regexp.MustCompile in 21 places. I built a version that used github.com/umisama/go-regexpcache instead.
File Setup
This section describes the files on my machine as I develop expandfilego.
- $HOME/go
- go-mode.el-master/ -- for Emacs
- howto -- basic how to do it notes
- other -- misc test programs
- $HOME/goproj
-
expandfile
-
main/
- main.go
- main -- result of compiling -- copy this to $HOME/bin/expandfilego
-
xflib/
- xflib.go
- go.mod, go.sum -- control files for go build
- golang.org/ -- library programs imported from Go
- github.com/ -- library programs imported from others
-
main/
- tests -- misc small tests.. will develop these into the standard
-
expandfile
Lessons
Here are some of the lessons I learned while writing the Go version of expandfile.
Go Syntax
Go's basic syntax is similar to C's. The big differences from Perl are:
- no semicolons
- variable names do not have sigils $ or % or @
- comments are // instead of #
- if statement does not need parentheses around the condition; braces mandatory
- Perl trailing if and unless not supported
- no if defined()
- for statement is generalized: replaces while as well
- Perl ref becomes a much more general type switch, or use reflect.TypeOf()
- no Perl var $_
- String quotes are all " -- single quotes are glyph
- no & in front of function
- pointers to an object with * not \
- Perl evaluation of arrays in scalar context gives length, Go uses len()
- Perl hash becomes Go map, strongly typed
- Iterators in Go are powerful
- sort and grep have function syntax
- packages are very different. Cannot reach inside them the way Perl allows.
It is reasonably easy to start by editing Perl source into Go with repeated edit passes, then trying to compile and fixing errors from the compiler. Most Go compiler error messages are clear and tell you what to fix.
Regular expressions
- Regular expressions are used 87 times in expandfile's Perl code. Perl caches the JIT compilation of regular expressions invisibly.
- I needed to make some changes from the Perl version. Because performance sucked with using the standard Go regexps, I tried rewriting the regexps in the Perl version into state machines.. decided to just use a modern RE package.
- RE features expandfile uses are: ., variable interpolation, capture groups with (), character classes and negative character classes, S, d, r, n, s, * + and ? counts, \ for unspecialing, i and g modifiers, ^ and $ anchors, etc.
- Some features of Perl regular expressions are not used by expandfile and need not be supported in whatever package we pick. For example: Q...E, p, P, X, K, R, backtracking, groups with {} and a count.
- Since Go regexps allow braces with counts, I had to escape all the braces in my regexps.
- Standard Go libs are slow. I chose https://github.com/wasilibs/go-re2, 20-40x faster. If I need to I can add caching, ala unisama.
Go Semantics
- Each Go program declares an import section and lists packages that provide language features.
- Packages export those functions whose name begins with uppercase.
- Terminal input and output are in package fmt e.g. fmt.Println
- Go variables are strongly typed. The allocation of each value includes a size.
- Go can infer a variable's type from initialization, e.g. fred := "" declares fred as a string.
Go Types
- Go has types. And Concrete Types. And inner types. And Interface types. I am still learning how to use all these appropriately.
- String values are immutable and include a length.
- Arrays, also subscripted in [], must have a max size declared.
- Go does not have implicit numeric conversions. Fine by me. Good compiler diagnostics help me put in explicit casts when needed.
- An array of bytes is different from string -- my initial version uses strings everywhere.
- Slices
- are a view on underlying array or string, subscripted in [x:y]
- subscripts are zero origin
- upper bound of a slice is tricky.. it is one past the last item included
- Hashes in Perl become maps in Go; each map's keys and values have fixed type.
- The SQL package uses interface{} types which are then cast to string values to put into the values table.
Resources
Many online resources are available for learning Go.
- Tour of Go
- Go library documentation. Well written and thorough.
- Many Web sites answer questions about Go and provide code snippets. Use Google.
- https://go.dev/doc/effective_go
Tools
The Go compiler and runtime are easy to download and install. On my Mac, I issued the command brew install golang and installation was painless. Currently I have go 1.18.4 installed.
Editing Go programs was tedious, until I installed golang-mode into Emacs. That made editing reasonable.
Testing
My Perl version has a test suite, testexpandfile, that exercises expandfile thoroughly. This was valuable for debugging the Go version. I just changed export EXPAND=expandfile to export EXPAND="go run expandfile.go" and ran the tests, fixed problems, ran it again, until it worked.
Source management and compiling
- Basic scheme: source tree, $HOME/goproj/expandfile/main/main.go, $HOME/goproj/expandfile/xflib/xflib.go
- cd goproj/expandfile/main; go build
- go vet
- copy main $HOME/bin/expandfilego
- expandfilego args...
- Expandstring (and other functions) are in package xflib contained in xflib.go in directory xflib
- There is no way to load packages dynamically in Go. (This is disappointing; one of the problems for people using expandfile was that they had to install the library routines for SQL and XML even if they never used those features, and this process was tricky.)
Numbers and conversions
- If Perl can find a meaning for a statement, it will execute it; Go won't compile a statement that is not just right.
- Perl gets the values of a variable in a "context" -- numeric, scalar, etc and tries to automatically convert the value.
- Example: in Perl print "a"+1; will print 1, because it sees "a" in a numeric context and converts it to 0.
- expandfile depends on the underlying Perl semantics:
- Described in the Camel book
- Contexts: float vs integer, string vs numeric vs dont-care, scalar vs list
- Value interpolation of variables in double quoted strings
- I wrote a Go function perlnum() that takes a string argument and converts it to float, so I can do arithmetic in float and store the value back as a string representation of int or float. If the conversion fails, it returns 0.0.
- I wrote tests for testexpandfile to explicitly check that behaviors that expandfile depends on are followed.
Packages
Go requires programs to import specific packages from the Go library and from external repositories. xflib.go imports about a dozen Go library modules:
- "github.com/wasilibs/go-re2" instead of "regexp"
- "io/ioutil"
- "net/http"
- "database/sql"
- "github.com/go-sql-driver/mysql"
- "time"
- "bytes"
- "fmt"
- "io"
- "compress/gzip"
- "io/ioutil"
- "net/http"
- "os"
- "os/exec"
- "strconv"
- "strings"
- "time"
- "container/list"
I have not found an adequate Go library module for XPATH access to XML files. See below.
To fetch packages from GitHub: go get -u github.com/antchfx/xmlquery
Functions
- Perl sub becomes Go func
- sub twodigit { becomes func twodigit(x int) string {
- Go functions must have signatures (Perl added optional signatures to Perl5 but I did not use them)
- funcs can return multiple values: a convention is to return a value and an error boolean
Executing shell commands
expandfile's *shell builtin executes a command. The package to import that does this is os/exec. I converted my Perl code to call execCmd := exec.Command(args[0], args[1:]...) and then run and capture its output.
The os/exec package locates the binary executable and calls it directly. That is, the command is not sent to the system shell to launch the target process. Perl's open($fh, "$cmd|") construct launches the command by calling the shell, which then invokes the command, as described in Section 16.3 of the Camel book.
- Existing expandfile expansions sometimes quote their arguments: the shell stripped the quotes off. Go's os/exec doesn't do this, and my utilities were failing to find files because the arguments had quote characters in the filename.
- Existing expandfile expansions also sometimes issued commands like %[*shell,&result,="mysql < inputfile"]%, using the shell angle brackets to redirect input or output, or the vertical bar to run a pipeline. Go's os/exec doesn't provide this functionality.
I rewrote my external command builtin to invoke sh -c commandline with os/exec. This makes it work the way Perl does, and saves me from having to rewrite existing applications of expandfile. This has an efficiency penalty, because it launches a shell process and then the command, but *shell is not used often in my pages.
CSV access
These functions pass basic tests. The Go version can read a local CSV file and expand a template. The *bindcsv test in testexpandfile passes for both local files and remote URLs. The *csvloop test also passes.
MySQL access
(Tried to do this as a separate package, even though there is no current way to load this feature only if needed. Failed to compile with a circular dependency. Merged the package content back into xflib.)
Because the query support does not return the number of rows, I have to count them by reading the rows with Next(). I had to count the rows as I processed them, and set the row count at the end of IterateSQL() instead of the beginning, so the iterator cannot use the value. I don't think this is an issue.
Three functions are defined: openSQL, lookupSQL, and iterateSQL.
Opening the MySQL database sets ColumnsWithAlias: true in order to include the table name in the column name.
https://pkg.go.dev/database/sql describes the sql interface for Go. http://go-database-sql.org/varcols.html describes how to deal with the query result using reflection.
Interesting: info on connection pooling and error handling. https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/
XML access
Started on this feature. Going to try to use https://github.com/antchfx/xpath which executes Xpath queries. There are difficulties with introspection.