ILOG Discovery > Data Sets > Web logs

In this folder, you will find the weblogs.pjd project, which lets you perform Web site analysis. The data is extracted from a one day log of an internal Web site. As such, it does not present some of the interesting patterns found in public sites, such as webbots or long-term usage behaviors. You may want to view your own log files with Discovery with the views defined below, following the indications provided at the end of this page.


What pages are viewed and when?

This view is a regular Map with the following attributes:

What you see

The first glimpse at this view reveals the structure of the Web site: several top level directories: doc, websupport, dev... each embedding other directories: one for each product in the doc subhierarchy.

Next, you can notice that this representation emphasizes the most viewed pages, in a hierarchical manner. For instance, the doc hierarchy has the most hits, as it has the biggest surface. Within this subhierarchy, the solver44 and solver50 subhierarchies have the most accesses. So this view tells you in one glance what are the most popular parts of your Web site.

Finally, the rainbow effect for each page shows time variations for each page access. Pages with a quite uniform color have been mostly accessed at one point in time, whereas pages which display a rainbow have been accessed at regular intervals. Notice that red (resp. blue) shifts in the rainbow distribution denote a slow down (resp. an acceleration) in the page access. For instance, pages in the ~truong hierarchy have been accessed more in the late period than in the early period.

What you can do


Who visits the Web site, and when?

This view is similar to the first one, but for the cluster parameter:

What you see

The first glimpse at this view reveals who accesses your Web site and when. On external Web sites log, this reveals usually the webbots and other automatic processes that scan your Web site.

Here, most visitors come unsurprisingly from the ilog.fr subdomain. Some users, like latitude200 have accessed a lot of pages during a short time period, others like eyquem have accessed it quite regularly throughout the day.

The color reveals time trends like in the first view, but on a per user basis.

All the suggested operations for the former view can be useful in this view too. To refine the representation, it is also possible to add a cluster level on URL to view which pages have been accessed.


When is the Web site accessed?

What you see

You see the Web usage by hour slices, the highest activity occurring unsurprisingly between 9am and 6pm. The 11am time slot shows a heavy usage by one single user (in pink/red).

This view is very useful on external Web sites to view the activity of webbots and indexing engines, and understand their exploration patterns. To refine the representation, it is possible to add a cluster level on URL to view which pages have been accessed during each time slot.


Spotting anomalies in Web usage

As usual, a parallel coordinates view is a good starting point to assess what further classifications and exploration may be worth doing.

This view is directly obtained with Visualization > Parallel Coordinates, then sorted by Host, then by Result to reveal page requests that have led to errors (404 errors, at the top of the view).

What you see

A few indications can readily be obtained from this view: first, the Protocol column is split between HTTP1.0 and HTTP1.0 requests, indicating that some users could benefit from an upgrade of their browsers.

In the Type column, we notice that we do not only have GET operations, but also POST and HEAD. Is this normal? We would get the answer by clustering on Type and viewing the detailed operations made with POST.

Finally, we notice that the Result column shows a significant number of 404 errors (page could not be found). We may want to study this a bit further. To do so, you may want to go advanced mode to cluster by Result type, which will enable you to focus on the 404 errors encountered in the log.


Where do errors happen?

To further investigate the causes of error, we have created this view, clustering the data set first by Result, then by URL, setting the projection's color to the Host name, to view both who and for which requests the errors were produced.

Finally, we focus on the 404 section, using the navigate tool.

What you see

This view focuses on the requests that have produced 404 errors.

We notice that these errors are spread within the hierarchy and among users. A Web site manager may want to look for pages on the site that contain these erroneous references, or check with the users how they encountered these errors.


Time series of Web access

This view is a 2D graph showing each user's activity as time series. This view can help reveal regularities in usage patterns.


How to read your own Web log files

Web log files follow a common format of the following structure:

HOST        USER GROUP [DATE]                "METHOD URL                    PROTOCOL" RESULT SIZE
192.168.2.8 - - [31/Aug/2000:00:37:19 +0200] "GET /doc/concert10/index.html HTTP/1.0" 200 560

Because of its varying separators, this format cannot be read with the default reader. You need to open Web log files produced by Apache and other standard compliant Web servers with Data > Open Custom File... then, in the Table Description Editor window, choose Reader: WWW Log instead of the default Text file reader.

You can reuse all the projections defined in this project as templates for the exploration of your own data.


Up to Data Sets.