AT&T Home | AT&T Labs | Research
AT&T Labs, Inc. - Research

The Yoix® Scripting Language

Home | What's New | Grammar | Documentation | Download | License | YDAT | YWAIT | Byzgraf | FAQs

Yoix / YDATCLF, A YDAT Instantiation for CLF Files

As indicated elsewhere, YDAT is a highly configurable data visualization and analysis tool. By means of a configuration file, different instantiations of YDAT can be created to handle wide-ranging classes of data sets. One particularly ubiquitous class of data resides in common log format or combined log format (CLF) files. These files are more commonly known as weblogs. The Apache site provides a description of the two, very similar CLF formats. In short, YDATCLF is a YDAT instantiation for analyzing weblogs.

If you ran the Yoix installer, you will find in the bin folder, in addition to a ydat script, a ydatclf script. You may use either of these to point YDAT at a weblog of interest to you. The distribution also comes with a small weblog sample and script for running YDATCLF on that file. That script is called ydatclf_demo1. To run that demo, execute:

ydatclf_demo1
at the command prompt. Figure 1 is a screen shot of the demo.
[Image: YDATCLF demo 1]

Figure 1. YDATCLF displaying the sample weblog data.

If you have a weblog of your own to examine, then let us suppose you are at the command prompt in the bin directory of the Yoix distribution and suppose that your weblog data file is called weblog.txt and also that it happens to be in the same directory. Then, you can examine it by using the command-line statement:

ydatclf weblog.txt
or, since the configuration file is where the ydat script expects it to be, you could use the command-line statement:
ydat -c clf weblog.txt
Alternatively, you could use the -C option to provide the explicit path of the configuration file, which on a Unix-type operating system might look like this command-line statement:
ydat -C ../ydat/scripts/config_clf.yx weblog.txt

There happen to be a few on-line weblogs that we found. One is at:

http://www.tec-paris.com/logs/online-access_log
You can look at it in your browser then save it as a local file or you can just point the ydatclf script at it:
ydatclf http://www.tec-paris.com/logs/online-access_log
It is not a particularly fascinating log.

A slightly more interesting case can be found at:

http://www.ib.hu-berlin.de/~mayr/wem/
and the three weblog links given there, namely:
http://www.ib.hu-berlin.de/~mayr/wem/sample1.log
http://www.ib.hu-berlin.de/~mayr/wem/sample2.log
http://www.ib.hu-berlin.de/~mayr/wem/sample3.log
You should look at the third sample log first. So, try:
ydatclf http://www.ib.hu-berlin.de/~mayr/wem/sample3.log
One thing you can notice is that the log must have been processed to convert the IP addresses to just the domain information in non-numeric format. Now, if you look at either of the other two sample logs, you will notice a very short, multi-colored spike all the way to the left and a very tall, multi-colored spike all the way to the right. A look at the time axis indicates that the left spike is so 20th century, 1970 to be precise, while the right spike is 21st century. If you sweep out the short spike, you will see a handful of records in the detail table. Clearly, the format of those records became garbled somehow, perhaps in the post-processing that was done to adjust the IP addresses. Let's use YDAT to find those records in the data file. Try the following:
  1. If the data file is not already on your local disk, then in the event plot window's menubar, select File->Save->All to save the complete data set to disk.
  2. In the Log Detail window, de-select all the records in the table. One way to do this action is to use the menubar option Show->All Off.
  3. Back in the event plot menubar, select File->Save->Selected.
  4. You can now use the Unix diff command to find the badly formatted records by entering something like the following at a terminal prompt:
    diff dataset_all dataset_selected
    which tells you, for example, that the problems lie at lines 928, 1430, 3561, 5357 and 8493 in the original sample1.log data file.
  5. Incidentally, to view just the correctly formatted records, select the following from the event plot menubar: File->Load->Selected.
Of course, if all we wanted to do was see the problem lines, we could have done File->Save->Unselected in Step 3 above, but we were looking to find the problem lines within the context of the original data file so we did a little extra work.

We recommend looking at your own weblog rather than these examples, since looking at your own data will be more interesting and more meaningful than these other data sets. We only mention these because they happen to be publicly available.

Some of you may have noticed the subtle up/down triangle in the upper-right corner of the event plot (and also the dot in the upper center of the event plot) which indicate that there is a split pane in the window. If you click on the down triangle, the split pane will open up, revealing another event plot. For this plot, the y-axis is unimportant, it is intended to indicate the sequence of activity. By filtering down to a single IP address, selecting a short duration of time in the event plot zooming in on the sequential plot until it fills the window and, finally, sweeping out all the points in the sequential plot so as to populate the detail table, you would end up seeing something like what is shown Figure 2. By reading down the table, you can determine what one visitor did at your site in sequence. The sequential graph helps to visualize the time spent on each page while the table quantifies the information.

[Image: YDATCLF demo 2]

Figure 2. YDATCLF displaying the sample weblog data and showing the sequencing plot.

 

Yoix is a registered trademark of AT&T Intellectual Property.