Piwik is an alternative to Google Analytics

Posted by on May 5, 2017 in Choose Privacy Week, data mining, libraries, Privacy Awareness, Privacy vs. Library 2.0, Protecting Privacy, student data privacvy, Vendor Privacy | 0 comments

By Adam Chandler
Cornell University Libraries
Director, Automation, User Experience, and Post-Cataloging Services

Like many other libraries, Cornell University Library uses Google Analytics (GA) to track website usage. GA, designed to support Google’s primary revenue stream, advertising, has many strengths, especially the fact that it does not cost money. However, given our tradition in libraries to protect reader privacy, a compelling argument can be made that Google Analytics is inappropriate for libraries. After a review of alternatives to GA following Edward Snowden’s revelations, we selected Piwik (piwik.org) as a replacement for GA. Piwik is free, open source, and perhaps most importantly, it supports local data collection. In this brief blog post, I will summarize what some in the library literature say about web analytics tools, explain why we selected Piwik, and describe what is involved when migrating from GA to Piwik.

This blog post is an abridged version of a much longer article I co-authored with Melissa Wallace.1 In researching that article, we found recommendations to use Google Analytics written by librarians in every year back to 2007. In reading through the librarian-authored articles advocating for the use of GA, clearly librarians like it, but what is odd is the extent to which the authors are disconnected from the reader privacy tradition in libraries. There is occasional mention of privacy as a consideration, but not enough to change the recommendation to use Google Analytics. The most explicit statement against the use of GA in libraries we found is a blog post published by the Ontario Library Association written by Susanna Galbraith. Galbraith writes:

Many of us in the library community who have a responsibility to assess the usage of our library’s websites have become very familiar with the popular Google Analytics. Google Analytics is free and robust, and yet the data it collects belongs to Google and is housed on U.S. servers, where data may be subject to the legislation of that country. While many may see this as inconsequential (hey, Canada.ca uses Google Analytics, why can’t we?), those of us in the library community who wish to uphold the longstanding tradition in our profession of protecting user privacy, may wish to seek other alternatives.2

We agree with Galbraith. For privacy-related reasons alone, Piwik is a better web analytics solution for libraries. It is also a powerful open source web analytics tool, feature for feature, on par with GA. The table below is a high level summary of the two products.

Functionality Piwik Google Analytics
Data storage Library controlled server Google controlled server
Data may be collected by Javascript widget embedded on page yes yes
Data may be collected ingesting Apache log files yes no
Command line SQL access to database yes no
Aggregate IP addresses to location-based groups defined by library yes no
Management of logins Centralized Decentralized
API yes yes
Real-time data yes yes
Event tracking yes yes
Segment or filter data yes yes
Customizable dashboard yes yes
E-commerce support yes yes
Goal conversion tracking yes yes
Search keywords yes yes
Geolocation yes yes
Heat mapping yes yes
Reporting features (email, export, etc.) yes yes
IP and URL exclusion yes yes
Plugins/CMS integration yes yes

 

Piwik installation was relatively simple, with library systems administrator following the steps outlined in Piwik’s online documentation. It is hosted on a Cornell University server. The university’s standard security profile is in place, with periodic scans and monitoring by Cornell central IT. We chose a user-friendly, product-agnostic URL (webanalytics.library.cornell.edu), at which the installation could be completed through an easy point-and-click process. In addition to the default installation, we set up a recommended automated cron task to process reports periodically; without this task the system would recalculate statistics on the fly and would be considerably slower. Last, we used Piwik’s log import script to parse our Apache logs. This process was also straightforward, and once configured, it runs automatically and does not require much day-to-day maintenance.

In addition to data collection by Apache logs, CUL also collects web statistics via Javascript. While Javascript embed code must be manually added to websites, it allows for greater customization and additional features, such as a real-time map of visitors and the tracking of exit links. The Javascript option also allows us to collect statistics on sites that are hosted by third parties, such as Illiad and 360 Link.

We would be remiss if we failed to acknowledge that not every institution has the IT resources of a library like Cornell. Before Piwik can see widespread adoption across libraries, IT support is a gap that might need to be filled by a privacy-sensitive non-profit.


References

1Adam Chandler and Melissa Wallace, “Using Piwik Instead of Google Analytics at the Cornell University Library,” The Serials Librarian 71, no. 3–4 (November 16, 2016): 173–79, doi:10.1080/0361526X.2016.1245645.

2Susanna Galbraith, “Piwik: Breaking Away from Google Analytics,” Open Shelf, http://www.open-shelf.ca/160215-piwik/ (accessed February 15, 2016).