Contents

Output
Installation Instructions
Usage
Configuration
Details and Definitions
Extracting Logs
Writing a new report section

New: log4j attacks


2023-11-08

Super Webtrax

Updated 08 Nov 2023 version 17

This is the help file for Super Webtrax version S24. updated
These Your feedback on the help file and the program is welcome.

SWT is open source and can be downloaded from https://github.com/thvv/swt

Brief Summary

sample first page

Super Webtrax (SWT) reads web server logs and produces an HTML page containing a daily web site usage report, covering the previous day's usage. The report has multiple report sections and many options.

Web servers, such as Apache and Nginx, write a log file entry every time they send a file to a user. Once a day, SWT loads a web server log into a MySQL database. SWT expands templates to produce HTML reports with graphs and tables.

I look at the report every day for

Report contents

A visit to your site is a sequence of web page views from the same net address. If SWT hasn't seen this address before, it's a new visitor. Some visits are from humans using a web browser: these are "non-indexer (or NI) visits". The rest are from web indexers building indexes like Google and Yahoo, or web crawlers mining pages for advertising: these are "indexer visits".

The report web page is divided into report sections by headings in blue bands. Click the little control on the extreme right of the blue band to expand a report section into a more detailed version.

For most people, the Month Summary and the Visit Details sections are the most interesting.

What SWT Does

SWT is a web server log file analysis program. It works best on logs that include the "referrer" and "browser" fields, such as the "NCSA Combined Format."

A typical Apache log file record looks like this:

207.46.13.81 - - [03/Jun/2021:00:01:14 -0400] "GET /mtbs/mtb757.html HTTP/1.1" 301 515 "-" "Mozilla/5.0"

Each log record has nine fields, separated by spaces. If a field might contain spaces, it is enclosed in quotes. The fields are

  [IP Address] [-] [username] [timestamp] [request] [statuscode] [bytes] [referrer] [user_agent]

where [request] is [verb] [resource] [protocol]

SWT reads a web server log file, loads the log file into a MySQL database, and writes over 40 different graphical and tabular report sections summarizing visits and hits.

SWT output is extremely extensible and customizable.

SWT uses programs written in Perl and MySQL (both are free) and is therefore portable to many platforms.

Experience with SWT

Super Webtrax has been used since 2006 on a small number of sites. See the "future work" section below. I have used SWT on web sites that have a few dozen hits per day, and ones that had a million hits per day. I have extended SWT for specialized sites with custom reports that summarize logs generated by server-side applications, and reports that look for particular access patterns, such as a "funnel analysis" that analyzed users' progress through a transaction.

I have used SWT with web server log file extracts that cover a day's worth of accesses, from various ISPs. For example, Pair Networks places a daily log extract in the directory www_logs, named www.yyyyMMdd, if you configure this option.

I have used SWT on Unix and Linux server machines that generate log files covering many days, and occasionally roll over to a new log. For example, I have set up virtual servers on Rackspace, installed Apache and MySQL, and used a program, logextractor2 (supplied with SWT) to extract the previous day's log records into a temporary log file, and fed that file to SWT. I ran logextractor2 like this:

  logextractor2 -day yesterday /var/log/www_log /var/log/www_log.0 > oneday.log

to handle the case where the log might roll over during a day and split usage into two files. Then I fed oneday.log to SWT.

I have used SWT to analyze traffic on a group of web server machines, extracting each server's previous-day log data and then modifying and merging the logs into a single stream of records, and producing one combined SWT report for all the servers.

SWT is oriented toward producing a single daily usage report and is not appropriate for real-time traffic monitoring. Sites with very large numbers of hits per day might want to create additional reports that summarize features of their usage.

 

Limitations of Web Logs

SWT can only display information from the logs it's given; some information may not get into the server logs because of

SWT ignores these problems. This is reasonable for web sites with light to medium activity, where a burst of accesses from the same IP address is usually the result of one user's actions. SWT does not use web cookies or JavaScript or Flash code to distinguish visitors.

It is up to the reader of SWT reports to interpret patterns of accesses in the report, for instance noticing that pages are requested faster than a human user could read or click, or the sequence of pages read does not follow from the link structure of the site. SWT has a few heuristics for marking some visits as "Indexer," for example if a session begins with a hit on robots.txt or has a user_agent of a known web crawler.

History

Webtrax was a Perl program originally written by John Callender on best.com about 1995, and substantially enhanced by Tom Van Vleck from 1996 to 2005. Many users contributed suggestions and features. Like any program that grew incrementally over ten years, Webtrax had its share of mistakes and problems. Its large number of options (over 80) and their inconsistent naming and interaction became an embarrassment. The non-modularity of the program made sense when it was little, but became a problem that inhibited further enhancements. Vestigial features that might or might not work littered the code. Perl was a wonderful tool for writing Webtrax, but it was used in a low-level way and its performance and memory consumption limited the size of log that could be processed.

Super Webtrax represents a second generation, begun in 2006. It loads the log data into a temporary MySQL database table and then generates report sections from queries against the database. The new version uses 7-10 times less CPU and substantially less memory (a report on 241,000 hits took less than 30 minutes to create). Because each report section is generated from one or more database queries, adding new report sections and debugging is easier. SWT's totals are more accurate and consistent, and data is more consistently sanitized against XSS attacks. Each report section is generated from a template using expandfile.

The downside of the new implementation is that users need to install more tools in order to run the program, and report developers need to know more (e.g. SQL) to enhance the output.

Features of Super Webtrax

Facilities required

The programs have been tested on FreeBSD, Linux, and Mac OS X.

Future work

Super Webtrax has been used for years by its developer. Click the "info" link in the navigation bar for a list of bugs and changes.

The current version of Super Webtrax assumes a knowledgeable UNIX shell programmer is configuring it. There is no GUI based installer or configurator: The scripts configure and install install the system, and the user can re-run these to make some configuration changes, or edit SQL statements in a text file.

If the SQL database or the computer running SWT crashes during report generation, recovery requires a knowledgeable UNIX shell programmer to look at the partial results and take appropriate action to restart.

Further work could be done to optimize the SQL database structure and queries to improve performance and scalability.

Information Processed

Super Webtrax displays three kinds of information:

SWT loads web server log records into a database table so that they can be queried with SQL, and deletes the hits and detailed derived information when the next day's log is processed. The fundamental design assumption is that the user does not have enough storage to save every log record indefinitely. Instead, the program saves cumulative counts in SQL tables.

Operation

SWT performs the following steps:

  1. Load the web server log into SQL tables, normalizing some data items and detecting visits. (User specified transforms are applied to the log data before loading.)
  2. Derive per-visit information by examining the rows for all log records.
  3. Perform SQL queries that calculate globally interesting totals.
  4. Create N report sections. For each section,
    1. Perform one or more SQL queries against the database.
    2. Expand an expandfile template that formats the query results into an HTML report section.
  5. Generate and execute SQL statements that update cumulative usage statistics.
  6. Create report sections summarizing cumulative usage statistics.

SWT is driven by a shell script that invokes functions for each step. One function loads the database; another performs SQL queries for various totals and loads their result into environment variables. For report sections, the reporting function invokes the utility expandfile to expand a template. Each template fetches its SQL query from the database, and then invokes SQL and formats the results into HTML. Different templates use different environment variables; a single template can be used to create multiple report sections by changing the SQL tables containing the query and the variables used for labeling the output. For example, there is one template that produces a horizontal bar chart preceded by three columns of numbers. Setting appropriate parameters in the configuration can cause different columns of different tables in the database to be queried and displayed. The configuration variables for SWT are stored in SQL tables.

Output

Super Webtrax produces its output in a series of report sections. Each report section can be enabled or disabled. By default, SWT produces its output in a web page, showing the heading for each report section and a brief summary. Clicking a control on the web page replaces the summary with the full content of the report section. SWT can optionally write additional output files for input to other programs. Below are all the report sections.

(report section numbers in the list below will vary depending on which report sections a site enables.)

Installation Instructions

Super Webtrax requires perl5 and MySQL 4.1 or better because it uses nested queries. The "configure" script will try a sample query to test that this feature works.

Perform the following steps on the machine that will be running Super Webtrax:

  1. Create a directory named /bin in your home directory for personal command line tools, and set your PATH.

          cd $HOME
          mkdir bin
          echo "export PATH=$HOME/bin:$PATH" >> .bash_profile
          . .bash_profile
        

    (the last line above assumes your shell is bash.)

  2. Make sure you have a reasonably recent version of MySQL installed. You should install it before you install CPAN module DBD::mysql. Set up a database username and password.

    "On Unix, MySQL programs treat the host name localhost specially, in a way that is likely different from what you expect compared to other network-based programs. For connections to localhost, MySQL programs attempt to connect to the local server by using a Unix socket file. This occurs even if a --port or -P option is given to specify a port number. To ensure that the client makes a TCP/IP connection to the local server, use --host or -h to specify a host name value of 127.0.0.1, or the IP address or name of the local server. You can also specify the connection protocol explicitly, even for localhost, by using the --protocol=TCP option."

    Set up the file .my.cnf in your home directory. It should look like

    	[client]
    	user=dbusername
    	password=pass
    	host=domain
    	database=dbname
    	[mysqldump]       
    	user=dbusername
    	password=pass
    	host=domain
    	database=dbname
        

    Execute chmod 600 .my.cnf.

    Make sure the mysql command works.

  3. Make sure you have a reasonably recent version of Perl installed. (On a Mac, see https://formyfriendswithmacs.com/cpan.html).

    Make a link from /usr/local/bin/perl to the Perl you will be using, so that shebang lines will work.

    Set your environment variables, for example if your Perl version is 5.26, to

          export VERSIONER_PERL_PREFER_32_BIT="no"
          export PERL5LIB="$HOME/bin:/opt/local/lib/perl5/5.26"
          export PERL_LOCAL_LIB_ROOT="/opt/local/lib/perl5/5.26"
          export PERL_MB_OPT="--install_base \"/opt/local/lib/perl5/5.26\""
          export PERL_MM_OPT="INSTALL_BASE=/opt/local/lib/perl5/5.26"
        

    Install the CPAN modules LWP::Simple, Term::ANSIColor, DBI, DBD::mysql, XML::LibXML, and XML::Simple. (You have to install MySQL first because DBD::mysql's installation tests access to MySQL.)

  4. Install expandfile to your $HOME/bin. Install Perl modules expandfile.pm readbindsql.pm readbindxml.pm eadapacheline.pm in your $HOME/bin. (These files are supplied in the /tools subdirectory.)

    Type the command expandfile and you should get a usage message like USAGE: expandfile [var=value]... tpt....

    Create a MySQL database for the log data. (If you wish to produce multiple SWT reports on one machine, you must create a different database for each report.)

  5. If you are going to to do GEOIP processing on your log file, using the MaxMind free geolocation database,

    • download and install libmaxxminddb from https://github.com/maxmind/libmaxminddb
    • copy /usr/local/include/maxminddb.h and /usr/local/include/maxminddb_config.h into /opt/local/include/
    • install the CPAN modules Try::Tiny GeoIP2::Database::Reader, and MaxMind::DB::Reader::XS
    • Get a license key from MaxMind (see Geoip Processing)
    • arrange to install the weekly geolocation database. (I use a cron job.)

  6. Create directory /swt/install-swt in your home directory.

    Visit https://github.com/thvv/swt in your browser. Click the green "Code" button. You can choose "Clone" or "Download ZIP." Move the downloaded files into your swt/install-swt directory. This populates the directory install-swt, including subdirectories install-swt/tools and install-swt/live.

Configure

Configure SWT by executing the command

  cd install-swt; ./configure

The first time you run configure you will be asked for multiple data items

Answer these questions.

The result of running configure is a file CONFIGURE.status which records the desired configuration. If you run configure when a CONFIGURE.status exists, it asks the questions again, but provides a default answer contained in the file, so that it is easy to change just a few configuration values, and just hit RETURN to accept the rest.

The configure script runs simple tests to ensure that mysql works and can create, load, and query SQL database tables with the supplied MySQL server name, database, userid, and password.

The configure script tests to ensure that expandfile works and can access the question answers from the shell environment. It then checks that expandfile can access the database and perform a nested SELECT with the supplied MySQL server name, database, userid, and password.

configure tests to make sure that logextractor2 works and that the MaxMind database is found, if GeoIP processing was selected.

configure uses expandfile to generate shell helper and configuration files that will be used when swt is executed.

Install

Check over the files that result from configuration, and then execute:

  ./install

The new software will be installed in the installation directory you specified. If it appears that this is a COLD install, you will be asked

  reset database???

and if you answer yes, the cumulative databases will be re-initialized.

Tailor the Result of Install

You can tailor your SWT configuration to your local setup by modifying the file swt-user.sql. The configure script set up an initial version, but you may wish to add more information, such as:

Log File Translation

To set up log file translation, tailor the cron job script created by configure to create your report page once a day.

If the web logs provided by your ISP do not contain referrer and agent, then Super Webtrax will not work well for you. If you control the web server configuration, select the NCSA Combined log format.

Where are your web logs and what are they called? The generated cron job assumes that some other agency places the logs in a specific directory, possibly gzipped. If your web logs must be copied from another machine, you may need to ensure that you can access the logs and handle the case of log rollover. This will require some shell script editing.

Does your web server log contain hits from one day or many?

  1. One day, extracted by your ISP. You may still need to use logextractor2, see below.
  2. Many days. Use logextractor2 to extract one day's hits at a time and process them. This option is not yet handled by configure.

How does your web server log indicate the source of a web transaction?

  1. Contains numeric IPs only. Use logextractor2 to do a reverse DNS lookup on IPs in input log file by specifying the -dns cachefile argument.
  2. Contains domain names as a result of reverse IP lookup by the ISP. You may still want to use logextractor2, see below.

If the cachefile gets to be large, like 16MB, then bad things will happen: truncate it every so often.

GeoIP Processing

Do you want IPs shown with a country name suffix and optional city, e.g. 12.178.27.243[US/Palo Alto CA] or adsl-226-123-174.mem.bellsouth.net[US]? logextractor2 can do this at additional cost in processing time, by using free data from MaxMind. To do this,

Does your web server log directly to SQL? If so, you will need to write a program to extract hit data from the table written by the server, and modify the swt and swtfunct.sh scripts to run it in place of logvisits.

Test

Try running Super Webtrax once from the command line and see what happens.

  ./swt http_log_file

It should produce swtreport.html. Correct any problems.

The installer generates a cron script to run Super Webtrax every night and to move the output files to your web statistics display directory. This job may require hand editing to adapt it to your operating system and account. Because jobs started by cron do not execute your shell startup, you should set $PATH and $PERL5LIB in your crontab. Try running the cron script from the command line to see if it creates a report page that looks right, and correctly moves it into your web space.

When the cron script is ready to install, use the facilities provided by your account to schedule it. On Linux or Unix, this may be the crontab -e command, or some other method provided by your operating system resource manager. Wait till the job runs and check the output. There can be access problems because the environment for cron jobs is not the same as the command line. Once you get a clean run of SWT, it should run without further supervision. I set my cron jobs up so that the program output is mailed to me every day, and glance at the message and delete it.

About once a month, I visit each client machine and delete files that have been processed.

Parts of Super Webtrax

The main parts of SWT are:

Data

Super Webtrax uses MySQL to store its data. It stores three kinds of data in the database:

Usage

Your nightly cron script will obtain a log data file to be processed, and then execute

  ./swt inputfile

If the log file name ends with ".gz", ".z", or ".Z", the file will be read through gzcat to unzip it.

The swt script reads the log and generates the report page. The cron script is responsible for

Configuration

The "config" link on the navigation links bar at the top and bottom of the web page goes to a generated configuration web page that displays the current reporting configuration.

General configuration

swtconfig.htmi is a configuration file that contains the location of the database server, database name, and user name and password on the server. Pointed to by swt. Secure this correctly if you are on a shared server. You may want to set up a .my.cnf file for use by mysql command (also secured).

mysqlload and mysqlrun are shell scripts used to load data into the MySQL database. mysqldumpcum is a shell script that is used to load data into the MySQL database. Pointed to by swt. These files may contain the database password, if the user cannot set up a .my.cnf file. Secure them correctly.

Global Options

The following values can be set in swt-user.sql in the wtglobalvalues table.

NameMeaningDefault
CHECKSWTFILESscript that makes sure files are present .. can be overridden by swt-user.sql$HOME/swt/tools/checkswtfiles
CLEANUPFile deletion command. For debugging, change rm to echorm
COMMANDPREFIXPrefix commands with this command, can be "nice" or nullnice
CONFIGFILEPathname of database configuration file -- should be mode 400, has database passwordswtconfig.htmi
cumquerybyteminmin number of bytes to keep query in wtcumquery table2500
cumquerycntminmin number of queries to keep query in wtcumquery table2
DASHFILEname of output dashboard reportswtdash.csv
DATADIRDirectory where data files are kept .. can be overridden by swt-user.sql$HOME/swt
ECHOchange to "true" to shut program upecho
EXPANDTemplate expander command./tools/expandfile
gbquotabandwidth quota in GB for this account0
gbquotadrophighestY to drop the highest day of the month in the bandwidth calculationN
glb_bar_graph_heightheight of a bar in horiz graph, also width of bar in vertical graph10
glb_bar_graph_widthwidth in pixels of horizontal bar graph500
IMPORTANTcomponent name of output important visits reportimportant
IMPORTANTFILEname of last-7-days reportimportant.html
LOGVISITSperl progperl ./tools/logvisits3.pl
MYSQLDUMPCUMhow to invoke mysqldump, contains password, must match config file, mode 500./tools/mysqldumpcum
MYSQLLOADinvoke mysql to source a file, contains password, must match config file, mode 500./tools/mysqlload
MYSQLRUNinvoke mysql for one command, contains password, must match config file, mode 500./tools/mysqlrun
ndomhistdaysnumber of days to keep in wtdomhist table366
OUTPUTFILEname of output reportswtreport.html
PATHSFILEname of output paths reportpaths.dot
pieappletheightHeight of pie chart220
pieappletwidthWidth of pie chart260
postambleFile copied at bottom of report
preambleFile copied at top of report
PRINTVISITDETAILperl progperl ./tools/printvisitdetail3.pl
PROGRAMDIRDirectory where templates are installed .. can be overridden by swt-user.sql$HOME/swt
REPORTDIRwhere to move the output report .. shd be overridden by swt-user.sql$HOME/swt/live
returnurlURL of user website, for return linkindex.html
siteidShort name for the dashboardUser
sitenameTitle for the usage reportUser Website
stylesheetname of style sheetswtstyle.css
TOOLSDIRDirectory where data files are kept .. can be overridden by swt-user.sql$HOME/swt/tools
urlbaseAbsolute URL prefix to live helphttp://www.multicians.org/thvv/
VISITDATAperl progperl ./tools/visitdata3.pl
visitdata_refspamthreshmore than this many different referrers in one visit is spamsign2
WORDLISTperl progperl ./tools/wordlist3.pl
wtversionversion of this program for selecting help fileS24

Configuration tables

The following configuration tables can be extended or altered in swt-user.sql.

wtboringBoring pages. Discourage display of a visit in the important details.
wtcolorsWatched pages. Which pages should be shown in details and what color to display them in.
wtexpected404Files expected to be not found. Suppress these from the not found listing.
wtglobalvaluesGlobal constants and configuration, documented above.
wthackfilenamesFile names that do not exist on the site and that attackers look for. Evidence of hacker attacks.
wthackfiletypesFile suffixes that do not exist on the site and that attackers look for. Evidence of hacker attacks.
wtheadpagesWhich pages are head pages.
wtindexersWeb crawlwers. Which user agents are web spiders etc.
wtlocalreferrerregexpLocal referrer definitions. Defines which domains count as part of the website.
wtpclassesFile assignments to visit class.
wtpiequeriesPie chart queries and weights.
wtpredomainTransformations applied to the source domain of each hit before processing.
wtprepathTransformations applied to file paths before processing.
wtprereferrerTransformations applied to referrers before processing.
wtreferrercolorWatched referrers. Which referring pages should be shown in details and in color.
wtreportoptionsReport option values, documented with the individual reports.
wtretcodesReturn code explanation. Describes the HTTP error codes.
wtrobotdomainsRobot domains. Which domains are used only by web crawlers.
wtshowanywayCombinations of referrer and pathname to display even if wtsuffixclass says not to.
wtsuffixclassSuffix classes. Grouping of file suffixes, and display options.
wtvclassesVisit class definitions and color assignments.
wtvsourcesVisit source definitions and color assignments. These sources are built into visitdata.pl.
wtwatchWatch for domains and browsers to display specially in details report.

Output Formatting

The web page is formatted using a standard style sheet unless you override the wtglobalvalues.stylesheet configuration item in swt-user.sql. If no style sheet is specified, the following definitions are used:

  <style>
  /* Styles for  superwebtrax report */

BODY {background-color: #ffffff; color: #000000;}
H1, H2, H3, H4 {font-family: sans-serif; font-weight: bold;}
h1 {font-size: 125%;}
h2 {font-size: 110%;}
h3 {font-size: 100%;}
h4 {font-size: 95%;}
th {font-family: sans-serif;}
.headrow {background-color: #ddddff;}
h2 {background-color: #bbbbff;}
h3 {background-color: #ccddff;}
.brow {}
.vc {}
.indexer {font-style: italic;}
.refdom {font-weight: bold;}
.firstrefdom {font-weight: bold; color: blue;}
.authsess {background-color: #ffffaa;}
.newref {color: red;}
.max {color: red;}
.min {color: blue;}
.query {color: green;}
.details {font-size: 80%;}
.details dt {float: left}
.details dd {margin-left: 40px}
.legendbar {font-size: 80%; font-weight: normal;}
.navbar {font-size: 70%;}
.chart {}
.chart td {font-size: 90%; padding-top: 0; padding-bottom: 0; margin-top: 0;
	   margin-bottom: 0; border-top-width: 0; border-bottom-width: 0;
	   line-height: 90%;}
.monthsum {}
.monthsum td {font-size: 80%; padding-top: 0; padding-bottom: 0; margin-top: 0;
	      margin-bottom: 0; border-top-width: 0; border-bottom-width: 0;
	      line-height: 80%;}
.analysis {font-size: 90%;}
.analysis td {font-size: 90%;}
.sessd {}
.pie {}
.fnf {color: gray;}
.cac {color: pink;}
.fbd {color: green;}
.flg {color: purple;} /* flag for 'wtwtach' notes */
.filetype {font-size: 80%;}
.illegal {}
.subtitle {font-size: 80%;}
.fineprint {font-size: 80%;}
.logtime {font-style: italic;}
.logtext {font-style: italic;}

.cpr2 {padding-right: 2em;} /* cell-pad-right */
.cpl2 {padding-left: 2em;} /* cell-pad-right */
.numcol {padding-left: 5px; text-align: right;} /* cell-pad-left-align-right */
.mthsum {padding-right: 10px;}
.alert {background-color: #ffffff; color: red;}

.vhisto {padding: 0 10px 0 0}
.vbar {padding: 0 2px 0 1px; margin: 0 0 0 0; vertical-align: bottom;
       font-family: sans-serif; font-size: 8pt; }

img.block {display: block;}
a:link {color: #0000ff;}
a:visited {background-color: #ffffff; color: #777777;}
a:hover {background-color: #ffdddd; color: black;}
a:active {background-color: #ffffff; color: #ff0000;}

.ctl {font-size: 12pt; float: right;} /* for the [+] control */
.h2title {text-decoration: none; color: black;}
h2 a:link {color: black;}
h2 a:visited {color: black;}
h2 a:hover {background-color: #ffdddd; color: black;}
h2 a:active {color: black;}

.starthidden {display: none;}
.short {font-size: 80%;}
.datatable {margin: 10px; float: left;}
.datatable td {padding-left: 5px;}
.datacanvas{float: left;}

.inred {color: red;}
.inblue {color: blue;}
.ingray {color: gray;}
.ingreen {color: green;}
.inorange {color: orange;}
.inpink {color: pink;}
.inpurple {color: purple;}
.inyellow {color: yellow;}
.inblack {color: black;}
.incyan {color: cyan;}
.indarkblue {color: darkblue;}
.infuchsia {color: fuchsia;}
.ingoldenrod {color: goldenrod;}
.inindigo {color: indigo;}
.inlightgreen {color: lightgreen;}
.inlime {color: lime;}
.inmaroon {color: maroon;}
.innavy {color: navy;}
.inolive {color: olive;}
.insilver {color: silver;}
.inteal {color: teal;}
.inviolet {color: violet;}
.inwhite {color: white;}

  </style>

Details and Definitions

Referrers

If the log being processed includes the referrer string, this indicates what page a vistor's browser was displaying when it generated a request for your file. If the hit was generated by a search engine, the query may be included in the referrer string. (Google has mostly stopped including this: see "Search Queries and Engines" below.) SWT uses the referrer string to drive a lot of its analyses. Web crawlers sometimes spoof the referrer string; web proxies sometimes remove it.

Search Engines and Queries

SWT detects some queries as coming from search engines. Many popular engines are built in. In 2013, Google changed to use HTTPS security for many search references. This changed the way that it presents links to result sites; the links no longer show the text for a search query in the REFERRER field, so SWT cannot display or summarize such queries.

How can a visit be "local?"

If what appears to be a visit starts with a hit referred by a local page, this may be a sign that the visitor is accessing the site very slowly, or that the visitor is accessing your site through a proxy that uses more than one address (microsoft.com seems to do this). Some web servers seem to put hits in their logs out of order, and this may also cause this. I have set the default expire_time up to 30 minutes, and still see a lot of these on my site. Visits that begin with a "local" hit are marked with "*" in the visit details.

Treating Certain Referrers as local to your site

Configure this in swt-user.sql table wtlocalreferrerregexp

Ignoring certain Visitors

Many people have asked to be able to ignore their own hits on their site.

Toplevel Domains

By default, SWT summarizes hits by toplevel domain, e.g. ".com". For toplevel domains that correspond to a country, the country name is shown.

Remapping File Names, Domains and Referrers

Configure this in swt-user.sql tables wtpredomain, wtprepath, wtprereferrer.

Return Code Summary

SWT produces a table of all transactions logged by the web server, organized by return code. Most of the transactions will have code 200; but code 304 means that a distant proxy was checking to see if the file had changed, so it also counts as a hit. Code 206 means that part of the file was returned; a big file might be requested in chunks. Currently SWT counts all of these transactions as hits, since it can't tell which partial content answers are part of the same request. Other return codes are counted but their transaction is not considered a hit.

Platform Summary

SWT produces a chart of total accesses by platform, that is, by operating system, based on the user-agent string. This chart matches patterns against the referrer string sent by the browser and is not 100% accurate (browsers and crawlers can misrepresent themselves).

Definition of Terms

hit

Each file transmitted by the server to a browser is logged by the web server as a "hit." For example, if a visitor visits an HTML page that refers to three GIFs, a .css file, and a Java applet, this visit would generate at least six hits, one for the HTML file, three for the GIF files, one for the css file, and one (or more) for the applet binary. (Assuming the visitor's browser has Java enabled and is loading images.) SWT can be told to ignore certain hits in various ways.

visit

If there is a sequence of hits from the same domain, these are counted as a single visit. If the hits stop for longer than a certain idle time, and then start again, SWT will see two visits. You can configure the length of the idle interval. (See "How can a visit be 'local?'".)

authenticated visit

If a section of the website requires the visitor to give a userid and password, the entire visit by the user will be marked with the userid given.

visit class

Each visit is classified with an identifier. Visits that appear to be from web indexers are classified as "indexer". Web indexer visits are detected if the visit touches the file "robots.txt" (which must exist for this check to work), or if it comes from a web indexer URL (from table wtrobotdomains) or user agent (from table wtindexers) listed in the configuration.

Non-indexer visits are assigned a class by looking at the class of each file hit, and choosing the most popular.

The table wtpclasses in the configuration defines a class specifier for file pathnames relative to the server root. File names can be specified as a path name, or as a directory prefix name. The most specific value will be chosen. Directory prefixes should begin and end with slash. If no matching class is found, and the hit pathname contains a directory, the first level directory name is used as the class name. The class specifier is a comma separated list of class names, so that a file can be declared to be a member of more than one class. Ambiguous specifications weigh 1/N for each class if they contain N classes.

The table wtvclasses in the configuration specifies the color the class will be shown in, and a short explanation of the class.

page

For comparing website activity, HTML page loads (that is, hits on pages with a suffix associated with 'html' in table wtsuffixclass) are more interesting than hits.

hit source

Each hit is classified as to its source; a hit may be the result of a search, the result of a link, generated by a web indexer, a local reference, or unspecified.

engine

Search engines are detected by looking at the referring URL, which has the URL of the search engine's page, and often the query used to search.

query

When a hit appears to come from a search engine, SWT tries to determine what the engine was searching for. It can't always extract the query; some engines, like Gamelan, don't put the query term in the referring URL, and in these cases SWT doesn't show a query. Google encrypts the query, so we don't show those. See Search Engines and Queries above.

Other web sites' pages can have hyperlinks to yours. If the person browsing your site uses a browser that sends the referrer info, and if your web server puts that information in the log, you can see who links to you and how often those links are used. SWT will summarize the number of links to your pages.

illegal

An illegal hit is a reference to an object on your site (not a source file) from a referrer that is not a source file on your site. One cause for this is people linking to your graphics from their pages. Another possible cause is an incorrect referrer string sent by a browser.

Visits with no HTML

Visits that do not reference any source files are summarized separately. Such visits may result from web crawlers that look only at graphics files or PDF files, or from illegal references to your graphics from others' sites, or from a reference to a graphic, PDF, or whatever on your site in a mail message. These visits are not shown in the visit details section.

Visitor

Each hit comes from a machine identified by its Internet domain name like barney.rubble.com. If the visitor cannot be identified by name, its IP Number is shown. If geoIP processing is performed by logextractor2, the IP will have a country name suffix (and optional city name) in brackets.

DSPV

Days since previous visit.

DSLV

Days since last visit. 0 if visited today.

Toplevel Domain

Toplevel domains are the least specific part of the name, like .com or .de.

Cumulative

SWT accumulates some statistics for a period longer than a day... someday.

Indexers

Search engines work by reading your pages and building a big index on disk. When they do this it creates a sequence of hits. SWT will count these separately if you tell it the names of the search engines' indexers or domains, and if the browser (user agent) name is provided in the log. You can suppress these indexer visits from the visit details by setting an option; if the hits are displayed, they are in the CSS class "indexer", which a custom style sheet can decorate.

Visit Details

You can suppress visits with fewer than a specified number of HTML pages: the default is 1. You can suppress visits by indexers.

Here is an example listing:

      16:38 xxx01.xxx.net -- g.html (gamelan:-) 0:01, ga.class 2:25, gv.class [4, 212 KB; MSIE] {code}
    

For each visit, Webtrax shows

  • Time of the visit. If $show_each_hit is "yes", the following are shown:
    • A double hyphen (--)
    • Names of files displayed. GIF files, sounds, etc. are not shown. Files whose path matches a filedisplay option will have their names shown in a specified CSS class. (Using this feature requires that you use a custom CSS style sheet.) If the rettype option is used to specify that transactions with a given server return code are to be shown in a CSS class, the class will be applied. The classes "fnf" (gray) and "cac" (pink) are in the built-in style sheet; others can be provided in a custom style sheet.
    • For all but the last file referenced, the time between references (in mm:ss format).
    • If a reference came from an external URL, it is shown in parentheses. This URL will be made a clickable hyperlink if it looks like doing so would work; it will be colored red the first time SWT sees this referrer. If this is a search engine query, the query parameters will be shown in the URL in green.
  • The number of hits and the number of KB transferred for this visit, in square brackets. If $show_browser_in_details is true, the browser type is included.
  • The visit class of the visit in braces.

Extracting Logs

The program logextractor22 is supplied with SWT. It reads an NCSA [combined] web server log and extracts a day's worth of data. It optionally does reverse DNS lookup on numeric IPs. It also optionally does geoIP lookup on numeric IPs, and Super Webtrax will accept domains with the geoip lookup already done.

    nice logextractor2 [-dns cachefile] [-geoipcity $HOME/lib/GeoLite2-City.mmdb] -day mm/dd/yyyy filepath ... > outpath
    nice logextractor2 [-dns cachefile] [-geoipcity $HOME/lib/GeoLite2-City.mmdb] -day yyyy-mm-dd filepath ... > outpath
    nice logextractor2 [-dns cachefile] [-geoipcity $HOME/lib/GeoLite2-City.mmdb] -day yesterday filepath ... > outpath
    nice logextractor2 [-dns cachefile] [-geoipcity $HOME/lib/GeoLite2-City.mmdb] -day all filepath ... > outpath

Finds all log entries that occurred on the given day and writes them to stdout.

The program can use the free geo-location database provided by MaxMind Inc at www.maxmind.com by specifying the -geoipcity argument with the path of the binary "city" database. You need a (free) license to download the database.

Merging Logs

Your web server may serve pages for multiple domains. One way to handle this is to map each domain to a separate directory and use the ability to name a visit class after a toplevel directory. However, some servers produce a separate virtual host web usage log for each domain served, and merge the server logs into a single log, altering the toplevel directory. The programs combinelogs and logmerge are supplied with SWT. logmerge reads multiple NCSA [combined] web server logs, merges them, and writes a combined log. combinelogs is used to find the files to merge and prepare the arguments. This facility should be run before processing with logextractor2.

  nice $BIN/combinelogs combinelogs.conf | sh

where combinelogs.conf looks like

  www -drop /thvv
  lilli.com -add /lilli
  formyfriendswithmacs.com -add /formyfriendswithmacs
  multicians.org

Each line in combinelogs.conf lists the prefix of one log file. combinelogs.conf looks in the current working directory for sets of logs having the same date, and invokes logmerge to merge them. Log files are expected to be named e.g. www.20110418.gz, where the .gz suffix is optional. The digits are required. If -add or -drop are specified, they are passed through to logmerge to alter the top level file pathname by adding or dropping a prefix. The resulting output is named comb.20110418.gz.

Supporting SWT on Report Generating Machines

Each web server computer that prepares an SWT report runs a daily CRON job to invoke SWT once a day: it generates an HTML formatted report on the previous day's usage. These computers can obtain the web logs for the previous day in several ways:

Running More Than One SWT on a Single Server

SWT uses a set of tables with fixed names for each report that it produces. To generate more than one SWT report on a server, you need to create more than one MySQL database. Each SWT instance will have its own directory subtree containing tailored files generated by configure and install. The instance will have its own swt-user.sql containting report tailoring and parameters.

For example, I manage an ISP account server used for client websites that provides virtual hosts for 13 domains. Separate web logs are generated by the ISP for each domain. One of these domains gets its own SWT report; the rest of the domains are combined into a single report.

To set this up, create a MySQL database for each group of sites, and a directory where SWT will be installed. Arrange for separate web logs to be generated for each site. Then install SWT once for each group. I set up the CRON job for daily log processing so that it moves the logs for each group into a different directory, and then runs SWT daily processing twice, from the different install directories. The 11 logs for the combined sites are processed by logmerge to create a combined log, with file names rewritten to include the site ID.

SWT Maintenance on my computer

Here is info on my personal setup that maintains and updates SWT installations on five web server machines. (Some of this need to be updated.)

Dashboard

At a non-published URL, I have set up a CGI that displays a daily status table. It expands an HTMX template that shows information for the previous day:

I usually check this once a day.

Writing a New report section

Every site will have a different swt-user.sql and cron job. You may need a local copy of the main shell script swt. Most control table changes can be done in swt-user.sql since it overrides swt.sql. New report sections can be provided to all SWT clients, or written specially for a particular client.

To create a new report section:

  1. Write the query first and try it out in mysql. If a new table is needed, figure out how to populate it from the existing tables or other data.
  2. If a new table is needed, write its init file and update statement, and try them out in mysql.
  3. Write the .htmt file. Usually it is best to start with an existing template and adapt it.
    1. From the query, determine the variables that should be passed to the template. The htmt file should not contain any SQL queries or knowledge of what the variable names in the query are: these come from SQL configuration tables. (A one-off report can relax this rule.)
    2. If the bar is to be striped or if names are to be colored or hyperlinked, adapt code from a similar template. For striped bars, the basic trick is to change the GROUP BY in the main query to return a row per stripe, and to switch to a new bar whenever some field changes value; often this requires a self join with the same table in order to get per-row totals as well as per-segment totals. MySQL behavior with GROUP BY is sometimes surprising and this may take tedious experiment.
    3. If the report section heading should contain totals, or if the longest bar in the chart is not the first, additional SQL queries must be run. Decide if these are global queries used by multiple report sections, or local queries for this report section only.
    4. Try to write the .htmt file so that it can be used by multiple report sections with the same layout but different data.
  4. Add the new report section to the table of report sections, wtreports. If this is for a single user, add it in swt-user.sql.
  5. Add the report section's parameters to the table of report section options, wtreportoptions. At least the template and enabled values should be provided.
  6. Add the report section's queries (if any) to the table of queries, wtqueries.
  7. if any new global queries are needed, add them to swt.sql and globalqueries.htmt.
  8. Test the new report section.
  9. Add a sectionrep call to the main shell script swt.
  10. To make the report available on client machines, push the template and swt.sql and swt to those that get it. If necessary, add configuration overrides to the client's version of swt-user.sql and push that also.

Changing the HTML format of a site's report

Changing header formatting

Individual reports are expanded by the "sectionrep" macro in swt given the section ID as argument as listed in the "swtreports" table. The "wtreportoptions" table identifies parameters for reports and these values are mapped into environment variables visible to expandfile, e.g. ('rpt_403','template','report403.htmt','') will cause variable rpt_403_template to be defined with value report403.htmt at runtime. Each report template .htmx file has similar boilerplate in its header, some of which, including the show/hide control, is in rptheader.htmi. If a report has a short and long form, the character is shown in the H2: clicking on any text in the H2 will switch these reports from short to long and back. Tables start open instead of closed if a line like the following is included into swt_user.sql:

  INSERT INTO wtreportoptions VALUES('rpt_domain','start','long','long=start with long report') 
  ON DUPLICATE KEY UPDATE optvalue='long';