Steven Phillips speaks about our unique research environment in this video:
Technical Documents
Logistic methods for resource selection functions and presence-only species distribution models Steven Phillips, Jane Elith
AAAI conference,
2011.
[PDF][BIB]
Association for the Advancement of Artificial Intelligence Copyright
The definitive version was published in AAAI conference. , 2011-08-07
{In order to better protect and conserve biodiversity, ecologists
are making increasing use of machine learning and statistical
modeling to understand how species respond to their
environment and to predict how they will respond to future
climate change, habitat loss and other threats. A fundamental
modeling task is to estimate the conditional probability
that a given species is present in (or uses) a site, conditional
on environmental variables such as precipitation and temperature.
For a limited number of species, survey data consisting
of both presence and absence records are available, and
can be used to fit a variety of conventional classification and
regression models. For most species, however, the available
data consists only of occurrence data�locations where the
species has been observed. In two closely-related but separate
bodies of ecological literature, a diversity of specialpurpose
models have been developed that contrast occurrence
data with a random sample of available environmental
conditions. The most widespread statistical approaches involve
either fitting an exponential model of species� conditional
probability of presence, or fitting a naive logistic
model in which the random sample of available conditions
is treated as absence data; both approaches have wellknown
problems, and in particular, do not necessarily produce
valid probabilities. In this paper, after summarizing
existing methods and their drawbacks, we overcome those
drawbacks by introducing a new scaled binomial loss function
that is straightforward to integrate into existing methods
such as GLM, GAM, and boosted regression trees, in order
to estimate an underlying logistic model of species presence/
absence. Our approach is simpler than the Expectation-
Maximization approach of Ward et al., which has not yet
been used by ecologists despite giving valid probabilities.
Following Ward et al., our method requires an estimate of
population prevalence, since prevalence is not identifiable
from occurrence data alone. We demonstrate that recent approaches
(such as the weighted distribution method of Lele
and Keim) that try to avoid the identifiability issue by making
parametric data assumptions do not typically produce
valid probability estimates. Lastly, we introduce two additional
new methods based on maximum entropy and a Chernoff
bound that both also estimate the underlying logistic
model given an estimate of prevalence.}
"The definitive version is available at onlinelibrary.wiley.com." , Volume 17, 2011-01-01
{MaxEnt is a program for modelling species distributions from presence-only species records. This paper is written for ecologists and describes the MaxEnt model from a statistical perspective, making explicit links between the structure of the model, decisions required in producing a modelled distribution, and knowledge about the species and the data that might affect those decisions. To begin we discuss the characteristics of presence-only data, highlighting implications for modelling distributions. We particularly focus on the problems of sample bias and lack of information on species prevalence. The keystone of the paper is a new statistical explanation of MaxEnt which shows that the model minimizes the relative entropy between two probability densities (one estimated from the presence data and one, from the landscape) defined in covariate space. For many users, this viewpoint is likely to be a more accessible way to understand the model than previous ones that rely on machine learning concepts. We then step through a detailed explanation of MaxEnt describing key components (e.g. covariates and features, and definition of the landscape extent), the mechanics of model fitting (e.g. feature selection, constraints and regularization) and outputs. Using case studies for a Banksia species native to south-west Australia and a riverine fish, we fit models and interpret them, exploring why certain choices affect the result and what this means. The fish example illustrates use of the model with vector data for linear river segments rather than raster (gridded) data. Appropriate treatments for survey bias, unprojected data, locally restricted species, and predicting to environments outside the range of the training data are demonstrated, and new capabilities discussed. Online appendices include additional details of the model and the mathematical links between previous explanations and this one, example code and data, and further information on the case studies. }
The definitive version was published in Biological Conservation (Elsevier). , Volume 143, 2010-07-01
{Indices for site prioritization are widely 2 used to address the
question: which sites are most important for conservation of
biodiversity? We investigate the theoretical underpinnings of
target-based prioritization, which measures sites' contribution to
achieving predetermined conservation targets. We show a strong
connection between site prioritization and the mathematical theory of
voting power. Current site prioritization indices are afflicted by
well-known paradoxes of voting power: a site can have zero priority
despite having non-zero habitat (the paradox of dummies) and discovery
of habitat in a new site can raise the priority of existing sites (the
paradox of new members). These paradoxes arise because of the razor's
edge nature of voting, and therefore we seek a new index that is not
strictly based on voting. By negating such paradoxes, we develop a set
of intuitive axioms that an index should obey. We introduce a simple
new index, "fraction-of-spare," that satisfies all the axioms. For
ingle-species site prioritization, the fraction-of-spare(s) of a site
s equals zero if s has no habitat for the species and one if s is
essential for meeting the target area for the species. In-between
those limits it is linearly interpolated, and equals area(s) / (total
area - target). In an evaluation involving multi-year scheduling of
site acquisitions for conservation of forest types in New South Wales
under specified clearing rates, fraction-of-spare outperforms 58
existing prioritization indices. We also compute the optimal schedule
of acquisitions for each of three evaluation measures (under the
assumed clearing rates) using integer programming, which indicates
that there is still potential for improvement in site prioritization
for conservation scheduling.
}
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
The definitive version was published in ACM GreenMetrics Workshop , Volume 38, Issue 3, 2010-10-10.
{Information and communications technology accounts for
a significant fraction of worldwide electricity consumption.
Given the relentless growth of demand for communications
services, telecommunications providers will need to transition
to more energy-efficient technology in order to limit
their environmental impact. Here we focus on priority-setting
for the transition process. In particular, we introduce a
method for statistically inferring the electricity consumption
of different components of the installed base of telecommunications
equipment, while avoiding the high cost of performing
direct measurements. Our method relies only on
databases of installed equipment in central offices, together
with aggregate electricity consumption per office. It takes
advantage of inter-office variation in installed equipment to
partition per-office electricity consumption by major equipment
type. When applied to a collection of 3,918 central
offices of a major U.S. telecommunications provider, our approach
reveals the (previously unknown) network-wide energy
consumption of each major type of equipment. In particular,
we find that electricity consumption is dominated by
Class-5 telephone switches, which account for 43% of aggregate
consumption, and which should therefore be a primary
target of central office electricity conservation efforts.}
Systems, Devices, And/Or Methods For Managing Sample Selection Bias,
Tue Oct 16 16:11:58 EDT 2012
A method for managing sample selection bias is disclosed. Embodiments of the method can include automatically determining an unbiased estimate of a distribution from occurrence data via a special purpose processor. The occurrence data that is utilized in determining the estimate may have an occurrence data sample selection bias that is substantially equivalent to a background data sample selection bias associated with background data. Additionally, the occurrence data may be related to the background data, and the background data may be chosen with the background data sample selection bias. Furthermore, the occurrence data may represent a physically-measurable variable associated with one or more physical and tangible objects or substances.
System and apparatus for recognizing speech,
Tue Apr 16 18:07:40 EDT 2002
A continuous, speaker independent, speech recognition method and system for recognizing a variety of vocabulary input signals. A language model which is an implicit description of a graph consisting of a plurality of states and arcs is inputted into the system. An input speech signal, corresponding to a plurality of speech frames is received and processed using a shared memory multipurpose machine having a plurality of microprocessors working in parallel to produce a textual representation of the speech signal.
Assigning and processing states and arcs of a speech recognition model in parallel processors,
Tue Jun 19 18:07:07 EDT 2001
A continuous, speaker independent, speech recognition method and system recognizes a variety of vocabulary input signals. A language model, which is an implicit description of a graph consisting of a plurality of states and arcs, is input into the system. An input speech signal, corresponding to a plurality of speech frames, is received and processed using a shared memory multipurpose machine having a plurality of microprocessors. Threads are created and assigned to processors, and active state subsets and active arc subsets are created and assigned to specific threads and associated microprocessors. Active state subsets and active arc subsets are processed in parallel to produce a textual representation of the speech signal. Embodiments of the invention include a two-level Viterbi search algorithm to match the input speech signals to context dependent units, an on-demand composition of finite state transducers to map context dependent units to sentences, and a determination whether a particular likelihood calculation needs to be performed or recalled from memory. The on-demand composition of finite state transducers is accomplished by multi-threading the calculation in accordance with the parallel processing feature of the system.
Awards
AT&T Science & Technology Medal, 2009.
For invention, leadership and advocacy of maximum entropy methods and Maxent software in conservation biology.