23 Oct 2024

expandfile in Go

I began writing expandfile in 2002. See expandfile-history.

Expandfile reads text files, expands macros, and writes a new text file. I use it to create HTML pages for web sites, and to perform other text transformation tasks.

My Project

These are my informal notes as I convert the Perl program expandfile to the Go language.

Go was started at Google in 2007 by Ken Thompson, Robert Griesemer, and Rob Pike. The language was released in 2009. The language was designed for large code projects and concurrent processing. It is easy to learn and use. It compiles to machine code. It has static types and garbage collects memory. It is free.

(Oct 2021) I began porting expandfile to Go 1.7.2. (I installed it with Homebrew).

(Jan 2022) I restarted porting expandfile to Go 1.22.1. (Homebrew now has 1.22.4) I had expandfile mostly running, except for XML. There were performance problems and inconsistencies with the Multics expansions.

(Jun 2024) I restarted porting expandfile to Go 1.22.1.

Background: Expandfile in Perl

I wrote expandfile in Perl. (I have used Perl since 1996.)

The main expandfile application is about 75 lines of Perl. It uses a library module, expandfile.pm, of about 1500 lines of Perl, plus other smaller libraries for SQL and XML processing. expandfile uses about a dozen Perl library modules from CPAN.

In some senses, I wrote expandfile in a style of "PL/I written in Perl." One can write very terse Perl statements that do a lot, but I chose to write less dense code that I would be able to understand later. This made it easier to translate to other languages, such as Python.

expandfile is open source and available without fee from GitHub. Documentation for the program is online.

Expandfile in Go

In Go the main expandfile application is 160 lines of Go. The application uses a library, xflib.go, of about 2523 lines of Perl, including SQL processing but not XML. The Go version uses about a dozen Go library modules including SQL.

I tried to make my Go code clear and modular. Initially I tried to translate one line of Perl to one or more lines of Go. See below for some lessons.

I have not yet put the Go version of expandfile into Github.

History

(12 Dec 2021) The basic expandfile functions work. testexpandfile runs and most tests pass:

Known bugs (15 Jan 2022)

Work to be done (19 Jan 2022)

Progress (02 Apr 2024)

Restarted this project.

Bugs found

Plans (06 Apr 2024)

Rewrite ExpandMulticsBody. Decide if we do the state machine or regexps.

Get test to pass.

Write more tests.

Try testexpandfile.

Measure performance and see if it is adequate.

What about XML.

Progress (16 Jun 2024)

see expandfile4.e.

Performance

(17 dec 2021) Expanding mx-net.htmx (2126 lines) took 0.889 seconds with Perl and 24.170 seconds with Go -- a factor of 27x slower. This was unacceptable. Recompiling the whole Multics site (478 files) would take over 3 hours, instead of about 6 minutes. My first Go version of expandfile used a simple set of functions to simulate Perl lists.

I analyzed what the Perl expandfile was doing. Basically it made 14 passes over the file, doing one transform at a time: block binding, variable and builtin expansion, Multics lookups (4 types), Multics formatting (4 types done twice). Each pass replaced the whole copy of the file with a changed one. Perl is very efficient about this; Go is less so. (Perhaps the Go version of the program invokes the garbage collector a lot, or it is not as fast as Perl's?) Further investigation with the profiler will help understand this. I tried using the Go profiling tools but wasn't able to get them working.

I did some experiments. I found that most of the slowdown came from the "Multics" source constructs that replace a string like "edited by {[VanVleck Tom Van Vleck [THVV]]}" with a hyperlink "edited by Tom Van Vleck [THVV]". This construct looks up an identifier in my local MySQL database and outputs a link.

I investigated whether MySQL access was slower in Go than in Perl by instrumenting the lookupSQL function. The 130 lookup calls in mx-net.htmx averaged a millisecond or two: not enough to explain over 20 seconds' delay.

I rewrote expandMulticsBody() and cleanRef() to perform all the formatting, lookup, and unwrapping operations in a single pass over their inputs, rather than repeated passes. My Perl regular expressions became character loops over a string with a state machine. This reduced the number of passes over the input from 14 to 3. Expanding mx-net.htmx executed in 3.537 seconds, a factor of 7 improvement. Go is about 4x slower than Perl. Compiling the source file with the Multics constructs removed takes about 2 seconds: still 2x as long as Perl.

Furthermore, my translation of regexps to state machines did not handle all cases correctly. Other projects claimed my time and I set the Go version aside.

(aug 2022) I learned more about Go and rewrote the list handling routines to use a more object-oriented struct that wrapped a container/list instance instead of a fixed array. This made the Go version about 4x slower than Perl, a factor of 7 more efficient.

I think the remaining performance issues have to do with regular expression caching. epm.go used regexp.MustCompile in 21 places. I built a version that used github.com/umisama/go-regexpcache instead.

File Setup

This section describes the files on my machine as I develop expandfilego.

Lessons

Here are some of the lessons I learned while writing the Go version of expandfile.

Go Syntax

Go's basic syntax is similar to C's. The big differences from Perl are:

It is reasonably easy to start by editing Perl source into Go with repeated edit passes, then trying to compile and fixing errors from the compiler. Most Go compiler error messages are clear and tell you what to fix.

Regular expressions

Go Semantics

Go Types

Resources

Many online resources are available for learning Go.

Tools

The Go compiler and runtime are easy to download and install. On my Mac, I issued the command brew install golang and installation was painless. Currently I have go 1.18.4 installed.

Editing Go programs was tedious, until I installed golang-mode into Emacs. That made editing reasonable.

Testing

My Perl version has a test suite, testexpandfile, that exercises expandfile thoroughly. This was valuable for debugging the Go version. I just changed export EXPAND=expandfile to export EXPAND="go run expandfile.go" and ran the tests, fixed problems, ran it again, until it worked.

Source management and compiling

Numbers and conversions

Packages

Go requires programs to import specific packages from the Go library and from external repositories. xflib.go imports about a dozen Go library modules:

I have not found an adequate Go library module for XPATH access to XML files. See below.

To fetch packages from GitHub: go get -u github.com/antchfx/xmlquery

Functions

Executing shell commands

expandfile's *shell builtin executes a command. The package to import that does this is os/exec. I converted my Perl code to call execCmd := exec.Command(args[0], args[1:]...) and then run and capture its output.

The os/exec package locates the binary executable and calls it directly. That is, the command is not sent to the system shell to launch the target process. Perl's open($fh, "$cmd|") construct launches the command by calling the shell, which then invokes the command, as described in Section 16.3 of the Camel book.

I rewrote my external command builtin to invoke sh -c commandline with os/exec. This makes it work the way Perl does, and saves me from having to rewrite existing applications of expandfile. This has an efficiency penalty, because it launches a shell process and then the command, but *shell is not used often in my pages.

CSV access

These functions pass basic tests. The Go version can read a local CSV file and expand a template. The *bindcsv test in testexpandfile passes for both local files and remote URLs. The *csvloop test also passes.

MySQL access

(Tried to do this as a separate package, even though there is no current way to load this feature only if needed. Failed to compile with a circular dependency. Merged the package content back into xflib.)

Because the query support does not return the number of rows, I have to count them by reading the rows with Next(). I had to count the rows as I processed them, and set the row count at the end of IterateSQL() instead of the beginning, so the iterator cannot use the value. I don't think this is an issue.

Three functions are defined: openSQL, lookupSQL, and iterateSQL.

Opening the MySQL database sets ColumnsWithAlias: true in order to include the table name in the column name.

https://pkg.go.dev/database/sql describes the sql interface for Go. http://go-database-sql.org/varcols.html describes how to deal with the query result using reflection.

Interesting: info on connection pooling and error handling. https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

XML access

Started on this feature. Going to try to use https://github.com/antchfx/xpath which executes Xpath queries. There are difficulties with introspection.