
180 Park Ave - Building 103
Florham Park, NJ
Firewall Fingerprinting
Dan Pei, Zihui Ge, Jia Wang, Amir R. Khakpour (Michigan State University), Josh Hulst (Michigan State University), Alex X. Liu (Michigan State University)
IEEE INFOCOM 2012,
2012.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in IEEE INFOCOM 2012. , 2012-03-22
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGmetrics 2011 , 2012-03-22.
{Firewalls are critical security devices handling all traffic in and out of a network. Firewalls, like other software and hardware network devices, have vulnerabilities, which can be exploited by motivated attackers. However, because firewalls are usually placed in the network such that they are transparent to the end users, it is very difficult to identify them and use their corresponding vulnerabilities to attack them. In this paper, we study firewall fingerprinting, in which one can use firewall decisions on a sequence of TCP packets with unusual flags and machine learning techniques for inferring firewall implementation.}
ALERT-ID: Analyze Logs of the network Element in Real Time for Intrusion Detection
Zihui Ge, Jie Chu, Richard Huber, Ping Ji, Yung Yu, Jennifer Yates
15th International Symposium on Research in Attacks, Intrusions and Defenses,
2012.
[PDF]
[BIB]
Springer Copyright
The definitive version was published in 2012. , 2012-09-12
{The security of the networking infrastructure (e.g., routers and switches) in large scale enterprise or Internet service provider (ISP) networks is mainly achieved through mechanisms such as access control lists (ACLs) at the edge of the network and deployment of centralized AAA (authentication, authorization and accounting) systems governing all access to network devices. However, a misconfigured edge router or a compromised user account may put the entire network at risk. In this paper, we propose enhancing existing security measures with an intrusion detection system overseeing all network management activities. We analyze device access logs collected via the AAA system, particularly TACACS+, in a global tier-1 ISP network and extract features that can be used to distinguish normal operational activities from rogue/anomalous ones. Based on our analyses, we develop a real-time intrusion detection system that constructs normal behavior models with respect to device access patterns and the configuration and control activities of individual accounts from their long-term historical logs and alerts in real-time when usage deviates from the models. Our evaluation shows that this system effectively identifies potential intrusions and misuses with an acceptable level of false alarms.
}

Rapid Detection of Maintenance Induced Changes in Service Performance
Ajay Mahimkar, Zihui Ge, Jia Wang, Jennifer Yates, Yin Zhang, Joanne Emmons, Brian Huntley, Mark Stockert
ACM CoNEXT (International Conference on emerging Networking EXperiments and Technologies),
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM CoNEXT International Conference on emerging Networking EXperiments and Technologies , 2011-10-31.
{Service quality in operational IP networks can be impacted due to planned or unplanned maintenance. During any maintenance activity, the responsibility of the operations team is to complete the work order and perform a check-up to ensure there are no unexpected service disruptions. Once the maintenance is complete, it is crucial to continuously monitor the network and look for any performance impacts. What operations lack today are effective tools to rapidly detect maintenance induced performance changes. The large scale and heterogeneity of network elements and performance metrics makes the problem extremely challenging.
In this paper, we present PRISM, a new tool for detecting maintenance induced performance changes in a timely fashion. PRISM uses association between maintenance and the network elements to identify performance metrics for time-series analysis. It uses a new Multiscale Robust Local Subspace algorithm (MRLS) to accurately identify changes in performance even when the baseline is contaminated. We systematically evaluate PRISM using data collected at four large operational networks: a tier-1 backbone, VoIP, IPTV and 3G cellular and show that it achieves good accuracy. We also demonstrate the effectiveness of PRISM in real operational environments through interesting case study findings.}

Q-score: Proactive Service Quality Assessment in a Large IPTV System
Jia Wang, Zihui Ge, Jennifer Yates, Ajay Mahimkar, Andrea Basso, Min Chen, Han Hee Song, Yin Zhang
ACM Internet Measurement Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Internet Measurement Conference. , 2011-11-02.
{In large-scale IPTV systems, it is essential to
maintain high service quality while providing a wider variety
of service features than typical traditional TV. Thus
service quality assessment systems are of paramount importance
as they monitor the user-perceived service quality and
alert when issues occurs. For IPTV systems, however, there
is no simple metric to represent user-perceived service quality
and Quality of Experience (QoE).Moreover, there is only
limited user feedback, often in the form of noisy and delayed
customer complaints. Therefore, we aim to approximate the
QoE through a selected set of performance indicators in a
proactive (i.e., detect issues before customers complain) and
scalable fashion.
In this paper, we present service quality assessment framework,
Q-score, which accurately learns a small set of performance
indicators most relevant to user-perceived service
quality, and proactively infers service quality in a single score.
We evaluate Q-score using network data collected from a
commercial IPTV service provider and show that Q-score is
able to predict 60% of service problems that are complained
by customerswith 0.1%of false positives. ThroughQ-score,
we have (i) gained insight into various types of service problems
causing user dissatisfaction includingwhy users tend to
react promptly to sound issues while late to video issues; (ii)
identified and quantified the opportunity to proactively detect
the service quality degradation of individual customers
before severe performance impact occurs; and (iii) observed
possibility to adaptively allocate customer care workforce to
potentially troubling service areas.}

Argus: End-to-End Service Anomaly Detection and Localization From an ISP�s Point of View
He Yan, Ashley Flavel, Zihui Ge, Alexandre Gerber, Dan Massey, Christos Papadopoulos, Hiren Shah, Jennifer Yates
Infocom 2012,
2011.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in Infocom 2012. , 2011-03-25
{Abstract�Recent trends in the networked services industry
(e.g., CDN, VPN, VoIP, IPTV) see Internet Service Providers
(ISPs) leveraging their existing network connectivity to provide
an end-to-end solution. Consequently, new opportunities are
available to monitor and improve the end-to-end service quality
by leveraging the information from inside the network. We
propose a new approach to detect and localize end-to-end
service quality issues in such ISP-managed networked services by
utilizing traffic data passively monitored at the ISP side, the ISP
network topology, routing tables and geographic information.
This paper presents the design of a generic service quality
monitoring system �Argus�. Argus has been successfully deployed
in a tier-1 ISP to monitor millions of users of its CDN service
and assist operators to detect and localize end-to-end service
quality issues.This operational experience demonstrates that
Argus is effective in accurate, quick detection and localization of
important service quality issues.}

Analyzing IPTV Set-Top Box Crashes
Jia Wang, Zihui Ge, Jennifer Yates, Han Song, Ajay Mahimkar, Yin Zhang
ACM Sigcomm Workshop on Home Networks,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
The definitive version was published in ACM Sigcomm Workshop on Home Networks, 2011-08-15.
{Recent advances in residential broadband access
technology have led to a wave of commercial IPTV deployment.
As IPTV services are rolled out at scale, it is essential
for IPTV systems to maintain ultra-high reliability
and performance. A major issue that disrupts IPTV service
is the crash of the set-top box (STB) software, which directly
resides inside the consumer�s home network and provides
the essential interface to both the user and the network to
deliver rich contents that go well beyond traditional TV. To
understand the potential causes of STB crashes, we perform
in-depth statistical analysis on the relationship among STB
crashes, video stream contents, and user activities using logs
collected from a large commercial IPTV system. Our initial
results suggest that (i) impaired video streams may cause
STB to crash, and (ii) continuous usage of STB may gradually
degrade the STB health over time.}

3G Meets the Internet -- Understanding Performance Issues due to Hierarchical Routing in 3G Networks
Seungjoon Lee, Zihui Ge, University of Texas Wei Dong
ITC 2011,
2011.
[PDF]
[BIB]
ITC Copyright
The definitive version was published in ITC 2011. , 2011-09-06, http://i-teletraffic.org/fileadmin/ITC23_files/Copyright_form_ITC2011.pdf
{The volume of Internet traffic over 3G wireless networks is sharply rising.
In contrast to many Internet services utilizing replicated resources
(e.g., content distribution networks), the current 3G standard
architecture employs
hierarchical routing, where all user data traffic goes through a small
number
of aggregation points using logical tunnels.
In this paper, we investigate the potential system inefficiency and
performance issues due to
the interplay of the two systems.
We first identify a number of aspects affecting system inefficiency and
service performance and then quantify the impact by analyzing trace data
obtained from a large-scale 3G network and a CDN provider.
We find that extra packet headers used for hierarchical routing
result in significant byte overhead (around 6\%).
We also find that the detour due to hierarchical routing can cause a
packet to travel
extra distance by up to 1627km on the average case, which based on
our data analysis, corresponds to around 45.4\% increase in round-trip
latency.
Furthermore, we identify a pathological case due to DNS caching when a
mobile
device switches between wireless technologies.
In our measurement study on the Internet, we find that this issue can cause
up to an order of magnitude throughput degradation (0.9Mbps vs. 10.8Mbps).
We also study how to achieve performance improvement by deploying
system resources more strategically.}

What Happened in my Network? Mining Network Events from Router Syslogs
Jia Wang, Zihui Ge, Dan Pei, Tongqing Qiu, Jun Xu
ACM/USENIX Interent Measurement Conference,
2010.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
The definitive version was published in ACM Internet Measurement Conference , 2010-11-01
{Router syslogs are messages that a router logs to describe
a wide range of events observed by it. They are considered
one of the most valuable data sources for monitoring
network health and for troubleshooting network faults and
performance anomalies. However, router syslog messages
are essentially free-form text with only a minimal structure,
and their formats vary among different vendors and router
OSes. Furthermore, since router syslogs are aimed for tracking
and debugging router software/hardware problems, they
are often too low-level from network service management
perspectives. Due to their sheer volume (e.g., millions per
day in a large ISP network), router syslog messages are typically
examined (manually by a network administrator) only
when required by an on-going troubleshooting investigation
or when given a narrow time range and a specific router
under suspicion. In this project, we design a SyslogDigest
system that can automatically transform and compress such
low-level minimally-structured syslog messages into meaningful
and prioritized high-level network events, using powerful
data mining techniques tailored to our problem domain.
These events are three orders of magnitude fewer in number
and have much better usability than raw syslog messages.
We demonstrate that they provide critical input to network
troubleshooting, and network health monitoring and visualization.}

Listen to Me if You can: Tracking user experience of mobile network on social media
Jia Wang, Zihui Ge, Jennifer Yates, Junlan Feng, Jun Xu, Tongqing Qiu
ACM/USENIX Internet Measurement Conference,
2010.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
The definitive version was published in ACM Internet Measurement Conference , 2010-11-01.
{Social media sites like twitter continue to grow at a fast pace.
People of all generations use social media to exchange messages
and share experiences of their life in a timely fashion.
Most of these sites make their data available. An intriguing
question is can we exploit this real-time and giant data-flow
to improve business in a measurable way. In this paper, we
are particularly interested in tweets (twitter messages) that
are relevant to mobile network performance. We compare
tweets with traditional source of user experience, i.e. customer
care tickets, and correlate both of them with network
incident reports. From our study, we have the following observations.
First, twitter users and users who call customer
service tend to report different types of performance issues.
Second, users on twitter are more accurate and faster to report
network problems that impact user experiences. Third,
tweets can show some short term performance impairments,
which are not recorded in incidents reports. These observations
prove that twitter a complimentary source for monitoring
network performance and their impact on user experiences.}

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large ISP Networks
Zihui Ge, Jennifer Yates, Lee Breslau, Dan Pei, He Yan, Dan Massey
ACM CONEXT 2010,
2010.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM CoNEXT 2010 , 2010-10-30, http://conferences.sigcomm.org/co-next/2010/.
{As IP networks have become the mainstay of an increasingly diverse set of applications ranging from Internet games and streaming videos, to e-commerce and online banking, and even to mission-critical 911 over VoIP, best effort service is no longer acceptable. This requires a transformation in network management, changing its focus from detecting and replacing individual faulty network elements, such as routers and line cards, to managing the service quality as a whole for end-users.
In this paper we describe the design and development of a Generic Root Cause Analysis platform (G-RCA) for service quality management (SQM) in large IP networks. G-RCA contains a comprehensive service dependency model that includes network topological and cross-layer relationships, protocol interactions, and routing and control plane dependencies. G-RCA abstracts the RCA process
into signature identification for symptom and diagnostic events, temporal and
spatial event correlation, and reasoning and inference logic. G-RCA provides a simple yet flexible rule specification language that allows operators to quickly customize G-RCA into different RCA tools as new problems need to be investigated and understood. G-RCA is also integrated with the data trending,
manual data exploration, and statistical correlation mining capabilities that are tailored for SQM. G-RCA has proven to be a highly effective SQM platform in several different applications and we present results regarding BGP flaps, PIM flaps in Multicast VPN service, and end-to-end throughput drop in CDN service.}
Towards Automated Performance Diagnosis in a Large IPTV Network
Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Qi Zhao, Ajay Mahimkar, Yin Zhang
2009.
[PDF]
[BIB]
Network Management: Fault Management, Performance Management and Planned Maintenance
Jennifer Yates, Zihui Ge
2009.
[LINK]
[BIB]
Troubleshooting Chronic Conditions in Large IP Networks
Jennifer Yates, Aman Shaikh, Jia Wang, Zihui Ge, Cheng Ee, Ajay Mahimkar, Yin Zhang
2008.
[PDF]
[BIB]
Supporting Internet Protocol (IPTV) in Backbone Networks: Design of Robust Routing Strategies
Jennifer Yates, Zihui Ge, Aman Shaikh, Meeyoung Cha, Wanpracha Chaovalitwongse, Sue Moon
2006.
[PDF]
[BIB]
Passive And Comprehensive Hierarchical Anomaly Detection System And Method,
Tue May 28 17:26:40 EDT 2013
A technique for monitoring performance in a network uses passively monitored traffic data at the server access routers. The technique aggregates performance metrics into clusters according to a spatial hierarchy in the network, and then aggregates performance metrics within spatial clusters to form time series of temporal bins. Representative values from the temporal bins are then analyzed using an enhanced Holt-Winters exponential smoothing algorithm.
Methods, Apparatus And Articles Of Manufacture To Perform Root Cause Analysis For Network Events,
Tue Apr 02 17:25:44 EDT 2013
Example methods, apparatus and articles of manufacture to perform root cause analysis for network events are disclosed. An example method includes retrieving a symptom event instance from a normalized set of data sources based on a symptom event definition; generating a set of diagnostic events from the normalized set of data sources which potentially cause the symptom event instance, the diagnostic events being determined based on dependency rules; and analyzing the set of diagnostic events to select a root cause event based on root cause rules.
Method And Apparatus For Finding Critical Traffic Matrices,
Tue Jul 24 16:11:10 EDT 2012
Method and apparatus for determining at least one critical traffic matrix from a plurality of traffic matrices, where each of the plurality of traffic matrices is organized into at least one of a plurality of clusters, for a network is described. In one embodiment, a merging cost is calculated for each possible pair of clusters within a plurality of clusters. A pair of traffic matrices that is characterized by having the least merging cost is then merged. The calculating and the merging steps are subsequently repeated until a predefined number of clusters remains, wherein the remaining clusters are used to determine at least one critical traffic matrix.
Method And Apparatus For Network-Level Anomaly Inference,
Tue Mar 27 16:09:39 EDT 2012
Method and apparatus for network-level anomaly inference in a network is described. In one example, link load measurements are obtained for multiple time intervals. Routing data for the network is obtained. Link level anomalies are extracted using temporal analysis on the link load measurements over the multiple time intervals. Network-level anomalies are inferred from the link-level anomalies.
AT&T Science and Technology Award, 2012.
For technical innovation and leadership in creating platforms for service quality management and network management in IP and mobility networks.