Yoix / YDATCLF, A YDAT Instantiation for CLF Files
As indicated elsewhere,
YDAT is a highly configurable data visualization and analysis tool.
By means of a configuration file, different instantiations of YDAT can be
created to handle wide-ranging classes of data sets.
One particularly ubiquitous class of data resides in common log format or
combined log format (CLF) files.
These files are more commonly known as weblogs.
The Apache site
provides a
description of the two, very similar CLF formats.
In short, YDATCLF is a YDAT instantiation for analyzing weblogs.
If you ran the Yoix installer, you will find in the bin
folder, in addition to a ydat script, a ydatclf script.
You may use either of these to point YDAT at a weblog of interest to you.
The distribution also comes with a small weblog sample and script for running
YDATCLF on that file.
That script is called ydatclf_demo1.
To run that demo, execute:
ydatclf_demo1
at the command prompt.
Figure 1 is a screen shot of the demo.
If you have a weblog of your own to examine, then let us suppose you
are at the command prompt in the bin directory of the Yoix distribution
and suppose that your weblog data file is called weblog.txt and also
that it happens to be in the same directory.
Then, you can examine it by using the command-line statement:
ydatclf weblog.txt
or, since the configuration file is where the ydat script expects it to
be, you could use the command-line statement:
ydat -c clf weblog.txt
Alternatively, you could use the -C option to provide the explicit
path of the configuration file, which on a Unix-type operating system might
look like this command-line statement:
ydat -C ../ydat/scripts/config_clf.yx weblog.txt
There happen to be a few on-line weblogs that we found.
One is at:
http://www.tec-paris.com/logs/online-access_log
You can look at it in your browser then save it as a local
file or you can just point the ydatclf script at it:
ydatclf http://www.tec-paris.com/logs/online-access_log
It is not a particularly fascinating log.
A slightly more interesting case can be found at:
http://www.ib.hu-berlin.de/~mayr/wem/
and the three weblog links given there, namely:
http://www.ib.hu-berlin.de/~mayr/wem/sample1.log
http://www.ib.hu-berlin.de/~mayr/wem/sample2.log
http://www.ib.hu-berlin.de/~mayr/wem/sample3.log
You should look at the third sample log first.
So, try:
ydatclf http://www.ib.hu-berlin.de/~mayr/wem/sample3.log
One thing you can notice is that the log must have been processed to convert the
IP addresses to just the domain information in non-numeric format.
Now, if you look at either of the other two sample logs, you will notice
a very short, multi-colored spike all the way to the left and a very tall,
multi-colored spike all the way to the right.
A look at the time axis indicates that the left spike is so 20th century, 1970
to be precise, while the right spike is 21st century.
If you sweep out the short spike, you will see a handful of records in the
detail table.
Clearly, the format of those records became garbled somehow, perhaps in the
post-processing that was done to adjust the IP addresses.
Let's use YDAT to find those records in the data file.
Try the following:
-
If the data file is not already on your local disk, then
in the event plot window's menubar, select File->Save->All to save
the complete data set to disk.
-
In the Log Detail window,
de-select all the records in the table.
One way to do this action is to use the menubar option Show->All Off.
-
Back in the event plot menubar, select File->Save->Selected.
-
You can now use the Unix diff command to find the badly formatted
records by entering something like the following at a terminal prompt:
diff dataset_all dataset_selected
which tells you, for example, that the problems lie at
lines 928, 1430, 3561, 5357 and 8493 in the original sample1.log data file.
-
Incidentally, to view just the correctly formatted records, select the
following from the event plot menubar: File->Load->Selected.
Of course, if all we wanted to do was see the problem lines, we could have
done File->Save->Unselected in Step 3 above, but we were looking to find
the problem lines within the context of the original data file so we did
a little extra work.
We recommend looking at your own weblog rather than these examples, since
looking at your own data will be more interesting and more meaningful
than these other data sets.
We only mention these because they happen to be publicly available.
Some of you may have noticed the subtle up/down triangle in the upper-right
corner of the event plot (and also the dot in the upper center of the event
plot) which indicate that there is a split pane in the window.
If you click on the down triangle, the split pane will open up, revealing
another event plot. For this plot, the y-axis is unimportant, it is
intended to indicate the sequence of activity.
By filtering down to a single IP address, selecting
a short duration of time in the
event plot zooming in on the sequential plot until it fills the window and,
finally, sweeping out all the points in the sequential plot so as to populate
the detail table, you would
end up seeing something like what is shown Figure 2.
By reading down the table, you can determine what one visitor did at your
site in sequence.
The sequential graph helps to visualize the time spent on each page while the
table quantifies the information.
Yoix is a registered trademark of AT&T Intellectual Property.
|