### Introduction (motivation) Most recent trajectory of research in

Introduction (motivation)

Most recent trajectory of research in the fields of information retrieval and

data mining invariably demonstrates growing interest in the problem of topic

evolution research from both academics and data science practitioners.

Accelerated advancement of internet technology and production of corpora

in the online environments from academic archives to social media assures the

growing importance of topic evolution research as a key tool to navigate in,

process, and find information in textual corpora.

Latent Dirichlet Allocation – the model behind most of the topic modeling

research studied within the scope of this paper – is a powerful tool, de facto

favored by most researchers today over its predecessor – Probabilistic Latent

Semantic Analysis (PLSA). However, the LDA method has serious shortcomings:

primarily in that it does not explicitly allow for modeling of temporal

relationships. In order to obtain results pertaining to the evolution of ideas,

LDA-based models need to be extended.

An important subject of inquiry in topic evolution modeling -and one that

sheds light on yet another weakness of the classic LDA-based models- is the

application to the online environments. Characterized by emergent (as opposed

to static) topics as well as by dynamic size of the corpus, where themes are

‘being born’ and vanish along the timeline, the online environments require

specific algorithms to track new topic emergence and old topic disappearance.

A modern classic in the family of LDA-based models -Blei and Lafferty’s Dynamic

Topic Model (2008)- has another serious drawback. The model imposes

constraints on the time period, penalizing large changes from year to year. Hall

et al. (2008) propose an alternative, where post-hoc calculations are performed

based on the observed probability of each topic given the current time period.

In this paper we analyze the DTM method and examine the improvements

ensuing from the efforts of Hall et. al. We also address the fact that most of the

probabilistic topic evolution algorithms are designed to work with static corpora

with a fixed pre-determined size. We examine the ways this technical challenge

is circumvented in research related to the online corpora topic evolution.

2 Literatue review (history of the problem)

The earliest work on topic evolution modeling dates back to DARPA’s 1998

Topic Detection and Tracking (TDT) (Allan 2012), an effort to optimize navigation

in and processing of the stream of news stories information.

The current state-of-the-art for the topic evolution modeling problem is most

accurately regarded as a split into three main types of models:

1) The discrete time topic evolution model,

2

2) The continuous time topic evolution model Wang, McCallum (2006);

Wang (2008); Kawamae (2011); Dubey (2013), and

3) The online topic evolution model Canini (2009); Hoffmann (2010).

The clear dominance of probabilistic topic models in the research on topic

evolution is largely due to the two seminal works: Hofmann’s Probabilistic

Latent Semantic Analysis (PLSA) (Hofmann 2001), and Blei’s Latent Dirichlet

Allocation (Blei 2003).

Blei and Lafferty (2006) represent documents distributed in time as generated

from a normal distribution centroid over topics, where the centroid of

each following time frame is generated from that of the preceding period. In a

slightly different approach, Wang and McCallum (2006) assume that each document

chooses its own time stamp based on a topic-specific beta distribution.

Shan and Li (2010) provide a thorough overview of topic evolution models

based on LDA with time (i.e., post-discretizing or pre-discretizing methods).

To track topic number and structure changes over continuous time, Elshamy

(2013) proposed a model based on the online hierarchical Dirichlet process.

Another class is directed probabilistic topic models, as summarized by Daud et

al. (2010), is the Direct Probabilistic Topic models – an unsupervised learning

paradigm with soft clustering abilities.

Though these studies provided valuable information, they all show relatively

narrow scope: in these works, both corpus and vocabulary size are predetermined,

which renders the methods inapplicable to the online environment.

Another drawback to the DTM model is that it imposes constraints on the

time period, penalizing large changes from year to year. Hall et al. (2008)

propose an alternative, where post-hoc calculations are performed based on the

observed probability of each topic given the current time period.

3 The Dynamic Topic Model method (Blei et al.)

3.1 Theoretical foundations

In Dynamic Topic Models, Blei et al. (2006) attempt to extend the basic topic

model described in Latent Dirichlet Allocation (Blei et al., 2003) to a state

space model. A basic topic model using Latent Dirichlet Allocation (LDA)

is a generative probabilistic model of a corpus of documents, and the typical

inferential problem that one tries to solve is the estimation of the posterior

probability of the hidden variables given a document:

p (?, z | w, ?, ?) = p (?, z, w | ?, ?)

p (w | ?, ?)

Where ? is the topic proportions of the document, z is the topic assignments

of the words in the document, w is a distribution of words in the documents,

3

? is a vector of K concentration parameters of the Dirichlet distribution from

which ? is drawn, and ? are the multinomial distributions of the K topics over

words in the vocabulary. Due to intractability of this distribution, the method

of variational inference is used to approximate the above posterior.

The model described in Blei et al. attempts to expand on generative model

described above by extending it to a state space model. This is done by linking

time consecutive vectors of parameters with normal distribution. Adopting the

notation used in Blei et al., we denote as ?k,t the vector of natural parameters for

topic k at time t, where the length of the vector, denoted V , equals the number

of words in the vocabulary. Although the Dirichlet distribution is typically used

to model uncertainty for distributions over words, it has a drawback in not being

amenable to sequential modeling. For this reason, Blei et al. elect to link the

natural parameters of ?k,t with a state space model that evolves with Gaussian

noise:

?k,t ? N

?k,t?1, ?2

I

A similar connection is made between ?t and ?t?1, where ?t is the mean of

logistic normal from which the topic proportions ? are drawn.

?t ? N

?t?1, ?2

I

After defining the links between consecutive time state spaces, the generative

process for the dynamic topic model described in Blei et al. can be specified at

a time slice t as:

1. Draw topics ?k,t ? N

?k,t?1, ?2

I

2. Draw ?t ? N

?t?1, ?2

I

3. For each document:

(a) Draw ? ? N

?t, a2

I

(b) For each word:

i. Draw Z ? Mult (? (?))

ii. Draw Wt,d,n ? Mult (? (?t,z)),

where ? maps natural parameters to the mean parameters:

? (xt)w =

exp (xt,w)

P

w

exp (xt,w)

While the use of Gaussian models allows for a straightforward interpretation,

it does have a setback: non-conjugacy of the Gaussian and multinomial models

cause posterior inference to be intractable. To get around this intractability

Blei et al. make use of two methods: Variational Kalman Filtering and Variational

Wavelet Regression. When applied to a unigram model, the variational

approximations of both methods were able to smooth out the local fluctuations

of the unigram word counts while still preserving the sharp peaks that might

4

be indicative of a significant change in the corpus at a certain time period. One

advantage that the wavelet regression method seems to have over the Kalman

filter method is that it is able to distinguish peaks that are close together.

3.2 Alternative Methods

In this section we will discuss similar methods that exist in the literature, focusing

on the advantages and drawbacks of each one.

Most non-trivial attempts to solving the problem of dynamic topic modeling

are based on one of two main methods: LDA and Probabilistic Latent Semantic

Analysis (PLSA). PLSA is alatent variable model that is typically used to make

some inference on the mixture of topic of a given document d containing words

w from some corpus. PLSA can also be viewed as a precursor to LDA. Indeed,

LDA was developed to address the shortcomings of PSLA, in particular the fact

that the latter does not properly describe the generative process by which new

document are created.

The Dynamic Topic Model by Blei et al. adapts the typical LDA topic model

to a discrete time state space model that is capable of determining the trends

of different topics in a corpus over time. This type of dynamic topic model, a

discrete dynamic topic model (dDTM), is not without its drawbacks. One of

theses drawbacks is that for this type of model the time is discretized. This

drawback is addressed in papers that propose an alternative model such as

Topic over Time by Wang, McCallum (2006) and Continuous Time Dynamic

Topic Models by Wang (2008). In Continuous Time Dynamic Topic Models

the authors explain the reason behind using discretized time. They state that

resolution or granularity at which the time scale is chosen could either undermine

the assumption of exchangeability between documents in the same time step (if

the scale is too large) or the number of variational parameters could increase

at an unmanageable rate (if the scale is too small). The authors then go on to

propose and develop a replacement model called the continuous dynamic topic

model (cDTM) which can be interpreted as the limit of a dDTM at the smallest

time scale. Although natural parameters are still used in cDTM to represent

the topics, in modeling how a topic changed over time the paper differs from

dDTM that uses a discrete Guassian process, and instead proposes to use the

Brownian motion:

?k,0 ? N (m, v0I)

?k,j | ?k,i, s ? N (?k,i, v?j,iI).

Here i, j are arbitrary time indices such that j > i > 0, ?j,i is the time

between i, j, s are the time stamps, and the variance is a function of the lag

between observations or documents.

After establishing the process for modeling the natural parameters of ?k,t

and how they evolved through time, Wang goes on to outline generative process

in a fashion similar to that of the dDTM, with two major differences: ?k,t was

being drawn from Brownian motion as a Gaussian Process, and the parameters

used to determine the distribution of topics in a document, ?t, were now being

5

drawn from a Dir (?), where ? appears to be constant with respect to time.

The paper explains in further depth that when the dDTM looks at a set of

observations over an interval of time, this interval has to divided into discrete

ticks. In dDTM, a parameter has to control the variance at each of these ticks,

and the variance over the whole interval is represented by this parameter multiplied

by the number of ticks made. This leads to the full distribution over

terms being explicitly represented as each tick. And as granularity of model

becomes finer with greater number of ticks, calculating the posterior inference

can become more computationally taxing, even if the document are sparsely

distributed across the time interval. Since in cDTM the variance is a function

of the lag between observations, it allows for the probabilities at discrete steps

between observations to be ignored. This also allows for inference to be handled

sparsely, which transforms the problem of selecting granularity from a computational

concern to a modeling one. Much like the dDTM make use of variational

methods for inference, cDTM also uses variational methods for inference, more

specifically the Kalman filter method. Although in theory the Kalman Filter

algorithm can be applied to the cDTM, as mentioned previously the finer the

desired granularity the greater the demand for computational resources to approximate

the posterior inference. To get around this computational concern,

a sparse Kalman filter method is developed. The main idea behind this sparse

version of the Kalman Filter method is to not explicitly represent the ?ˆ

t of a

word w if it has not been observed at time stamp t.

Temporal Text Mining (TTM) is another method for model topics in a corpus

of documents and how they evolve and change over time. In a paper by

Mei and Zhai (2005) the authors lay out the process for which a TTM model

could be built to discover the evolutionary theme patterns (ETP) in a corpus

of documents. They first define a topic or a theme as a probability distribution

of words that share a topic in common across a corpus. A theme span is then

defined as themes that span a given time interval (in the paper the terms theme

and theme span are use interchangeably). Next an evolutionary transition is

defined as the connection between two topics that have similarity score above a

predefined threshold. After introducing these definitions, Mei and Zhai. build

a Theme Evolution Graph: a weighed directed graph where each vertex is a

theme (span) and the edges between them are the evolutionary transitions. In

order to find the the probability distribution associated with the theme spans,

the Expectation Maximization (EM) algorithm is implemented. To find the

evolutionary transitions they use the KL-divergence between two theme distributions.

If it was above a certain threshold, we can say that there is an edge

between the vertexes that represents the two theme distributions.

6

4 The improved method (Hall and Jurafsky)

4.1 Motivation (the drawbacks of Blei et al.)

In the paper ‘Studying the History of Ideas Using Topic Models'(2008) Hall,

Jurafsky, and Manning applied the idea of modeling topic evolution to the domain

of scientific literature in what remained to be the state-of-the art model

for the task of topic evolution until 2011 (Thomas et al., 2011).

Focusing on the domain of Computational Linguistics, the authors induce

topic clusters from the documents in the ACL Anthology corpus () to study

historical trends in the directions of scientific research. Using the estimations of

the strength metric, they identify increase or decline in the popularity of each

topic.

The crux of their method is a straightforward application of the unsupervised

topic modeling based on the LDA model by Blei et al. (2003). To model

temporal relationships, the authors begin with adopting Blei’s Dynamic Topic

Modeling approach. As discussed before, the DTM represents each year’s documents

as generated from a normal distribution centroid over topics. Every t-th

year’s centroid is generated from that of year t ? 1. The key improvement of

Hall et al. over Blei et al. addresses the following drawback of the DTM: the

original model imposes constraints on time periods, penalizing large year-toyear

changes. (This is also true for the Topics Over Time model of McCallum

et al., where each document is assumed to choose its own time stamp based on

a topic-specific beta distribution that stays relatively inflexible.)

4.2 Theoretical foundations

To address the shortcomings of the model by Blei et al., Hall et al. propose a

model with one novel element: the use post hoc calculations based on the observed

probability of each topic given the current year. An empirical probability

that an arbitrary paper d written in year y was generated from topic z is defined

as follows:

pˆ(z|y) = X

d:td=y

pˆ(z|d)ˆp(d|y)

=

1

C

X

d:td=y

pˆ(z|d)

=

1

c

X

d:td=y

X

z

‘

i?d

I(z

‘

i = z),

where td denotes the date a document was written, and probability ˆp(d|y) is

taken to be a constant 1/C. Hall et al. perform parameter estimation using

collapsed Gibbs sampling (Griffiths and Steyvers, 2004).

To examine the change in strength of each topic over time, the authors begin

by fitting a model of 100 topics from which the supposed generative process

draws. The authors, however, only keep 36 topics that they judge to be relevant

to the corpus, proceeding to manually select seed words for an additional 10

7

topics. The totaling 46 topics were then used as priors to a new run of fitting

100 topics to the corpus. (The authors maintain transparency in explicitly

marking the topics derived from manual seeds in their paper.)

The remainder of the investigation by Hall et al. is purely applied and poses

no interest with respect to the subject of this overview: the authors examine

trends in the probability mass of a given topic over time, visualized as a plot of

a smoothing of ˆp(z|y).

4.3 Corpus entropy and cross-corpora convergence

A separate purely practical application explored by Hall and Jurafsky in the

paper is a model that examines topic diversity in the proceedings of several scientific

conferences. Interested in cross-conference comparison of trend dynamics

that exist within a given conference proceedings corpus, the authors define the

empirical distribution of a topic z at a conference c as follows:

pˆ(z|c) = X

d:cd=c

pˆ(z|d)ˆp(d|c)

=

1

C

X

d:cd=c

pˆ(z|d)

=

1

c

X

d:cd=c

X

z

‘

i?d

I(z

‘

i = z),

which they then condition on the year foe each conference, transitioning to

pˆ(z|y, c).

Following that, the authors introduce the second novelty in their paper: they

propose to study the breadth of a conference output by introducing the topic

entropy:

H(z|c, y) = ?

X

K

i=1

pˆ(zi

|c, y) log ˆp(zi

|c, y).

The conditional entropy of the topic distribution of a conference is a measures

of the average amount of information expressed by each assignment to a random

variable. A higher topic entropy value signifies that the probability mass is more

evenly distributed across topics.

To investigate whether the topic distributions of the conferences were converging,

the authors examined the Jensen-Shannon divergence between each

pair of conferences.

4.4 Drawbacks of the model

An illustration of some of the drawbacks of the model described above can be

seen when studying literature that applies the Hall model to software engineering

repositories (Linstead et al., Thomas et al.) From Thomas et al. (2011):

“The Hall model was initially designed for corpora that have different properties

than source code histories. The Hall model typically operates on corpora

8

in which each version is completely different (e.g., conference proceedings) from

the previous. Files in source code histories, however, are typically incrementally

updated between versions (e.g., main.c version 34 versus main.c version 35), resulting

in partial or full duplications of files between successive versions. After

applying the Hall model to source code histories, we have found that this duplication

effect causes unintended results that reduces its effectiveness.” -Indeed,

the Hall model applies LDA to all versions at the same time.! “To combat these

issues, we propose the Diff model, an extension to the Hall model that operates

only on the changes between versions of the repository.”

5 Further research developments

In Thomas et al. (2011), the Hall model is adopted for the study of topics

in versions of the same software development documents. The intuitive and

simple application of LDA to all versions of a document makes the Hall model

attractive for use on such datasets: during the post-processing phase, topic

metrics are computed by first isolating all documents in a particular version.

Assignment of a topic za at version Vj is computed as the summation of the

topic’s memberships across all documents in Vj :

A(za, Vj ) = X

|Vj |

i=1

?di,j a.

However, since most changes are very small, the word co-occurrences that the

LDA operates on will largely remain the same, contributing to a skewed in the

direction of the duplicated files.

The authors thus propose a model that adds a “diff” step to the Hall model

to isolate the changes between successive versions of each document, described

in the authors’ illustration below:

In comparison with Hall et al., the latter model should find less distinct

topics (each topic having multiple concepts confounded), and new documents

will tend to be matched to fewer topics, resulting in fewer spikes in the evolution.

Significant development over the initial LDA model were made in the direction

of online inference. In the online LDA model (OLDA) (AlSumait et al.

2008) algorithm for mining text streams, an empirical Bayes method creates a

new model whenever a text item is added to the corpus.

9

Iwata et al.’s Multiscale Dynamic Topic Model (MDTM) (2010) employs

multiple timescales to analyze online topic evolution document datasets. In this

model, the multiscale word distributions of the former epoch are assumed to

generate current topic-specific distributions. In a unique approach, the MDTM

detects topic evolution in multiple time scales as well as respective topic popularity,

using the Dirichlet-multinomial distribution to describe the topic evolution.

In a rare detour from the reliance on the Dirichlet distribution, Nallapati

et al. (2010) model topic evolution using non-homogeneous Poisson process in

their Multiscale Topic Toography (MTTM) method.

Both MTTM and MDTM have a drawback in treating the number of topics

as well as time interval and scale as fixed. A method that clusters news articles

into storylines to extract popular topics and key ‘entities’ within each story,

as proposed by Ahmed et al. (2008), combines LDA with clustering. The

clustering is organized hierarchically: relevant news are gathered into a story,

and topic models are applied to each story separately. The strength of each

storyline is governed by a non-parametric model based on a generalization of

the Chinese Restaurant Process algorithm (which can be naturally interpreted

as a hierarchical Dirichlet process).

Both OLDA and MDTM are LDA-based models characterized by a relatively

low algorithm time complexity, allowed in part by treating the number of topics

as fixed (in contrast to the non-parametric Bayesian model by Ahmed et al.).

In all three online inference models discussed, an EM algorithm or a Gibbs

sampling method is employed.

6 Open questions and future research directions

One of the critical issues in topic modeling is the posterior inference for individual

data instances. This problem is particularly important in streaming online

environments, and some existing work (Than et al., 2015) suggests the use of

the Frank-Wolfe algorithm for recovering sparse solutions to posterior inference.

Another significant drawback of most literature expanding on both PLSA

and LDA models is that they tend to neglect the auxiliary information (citation

networks, authors, etc.). Yet, augmenting the LDA approach with auxiliary

metadata during the process of topic extraction could potentially aid the interpretation

of the topic model.

A major challenge pertains to the evaluation of experimental results. In

practice, there is no coherence between model perplexity as evaluated on the

held-out test set and the semantic importance of topic evolution results. According

to Chang et al. (2009), topic evolution models that show better held-out

qualities may actually represent fewer semantically meaningful topic evolution

results. There exists no unified criteria for evaluation of topic evolution models,

making quantitative comparison of experimental results among different models

either infeasible or -when proxied via qualitative methods- subjective. Establishing

a standard metric for the evaluation of topic evolution is an important

future development task.

10

Another problematic statistical assumptions pertaining to the probabilistic

topic models discussed is the “bag of words” principle. In natural language,

the semantic relationship among words, as well as their order do not appear

to be unimportant with regard to the evolution of topics, and the omission of

these factors is perhaps a relaxation at the cost of semantic meaningfulness.

There exist models (Wallach, 2016) that propose to replace words with phrases

as basic semes. We also believe that NLP (natural language processing) technology

is applicable to topic evolution research and may yield more practical

and meaningful findings.

In conclusion, 15 years since the introduction of the Latent Dirichlet Allocation,

topic modeling is a mature area of research, but the more complex niche

within it -topic evolution modeling- has not been developing without concrete

methods for cross-model comparison, albeit in a coherent and definite direction.

Overcoming the challenges outlined above would yield a strong unification of

the scientific efforts towards more semantically meaningful results.