Massive mobility data clustering,
some experiments and problems

Etienne Côme COSYS/GRETTIA
Ifsttar

Séminaire Labex Bezout, 14 Mai 2019

Etienne Côme

Ifsttar,

urban and mobility data

data-science, un-supervised learning

Background

Massive mobility data

Mobility




Lets a lot of Digital Footprints

Sensors everywhere ...


for whom, for what ?

Quite voluminous !! Must be analysed

! communicate results

reappropriation

Smart-card data

to deal with paiement but also ...

smart-card data vs. surveys


Interests for transport operators...


... and for the public stakeholders

Visualize and analyze Vélib' data

Visualize and analyze Vélib' data

2 data sources :

Stocks

in open-data (Real-Time Apps)

Flows

Origines / Destinations sometime in open-data (London, New-York, Boston,...) frequently not

Animated bikes stocks

Animated bikes stocks

Stocks data : vlsstat

Stocks data : vlsstat

Discriminative Functional Mixture Model

\begin{eqnarray} X(t)=\sum_{j=1}^{p}\gamma_{j}(X)\psi_{j}(t),\label{eq:X} \end{eqnarray} where $\gamma=(\gamma_{1}(X),...,\gamma_{p}(X))$ is a random vector in $\mathbb{R}^{p}$, \begin{eqnarray} p(\gamma)=\sum_{k=1}^{K}\pi_{k}\phi(\gamma;U\mu_{k},U^{t}\Sigma_{k}U+\Xi), \end{eqnarray} Estimation EM + (fPCA) step = FunFEM

Clustering of stocks

The Discriminative Functional Mixture Model for the Analysis of Bike Sharing Systems [preprint]

Clustering of stocks

The Discriminative Functional Mixture Model for the Analysis of Bike Sharing Systems [preprint]

Clustering fonctionel sur données de stocks

The Discriminative Functional Mixture Model for the Analysis of Bike Sharing Systems [preprint]

Urban dynamics through the observed flows

http://www.comeetie.fr/galerie/velib/

Model


Simple generative model :

$$Z_s\sim\mathcal{M}(1,\pi)$$ $$X_{sdt}|\{Z_{sk}=1,W_{dl}=1\}\sim\mathcal{P}(\alpha_s\lambda_{klt})$$ + constraints $\sum_{l,t}D_l\lambda_{klt}=DT, \forall k \in\{1,...,K\}$,
with $D_l$ number of days in $l$.

Modèle


Likelihood :

$$Lc(\mathbf{\Theta};\mathbf{X},\mathbf{Z},\mathbf{\alpha},\mathbf{W})=\sum_{s,k}Z_{sk}\log\left(\pi_{k}\prod_{d,t,l}po(X_{sdt};\alpha_s\lambda_{klt})^{W_{dl}}\right)$$ Estimation EM, + extension to use meteo data

Urban dynmics through the observed flows

http://www.comeetie.fr/galerie/velib/

Urban dynmics through the observed flows

Crossing with contextual data // socio-eco

Crossing with contextual data // socio-eco

Crossing with contextual data // socio-eco

hab/ha emp/ha serv/ha com/ha
* 162 237 4.2 3.7
Leisure (1) 367 189 6.3 4.4
Leisure (2) 261 322 7.7 6.9
Parks 172 90 2 1.7
Stations 209 206 2.4 1.8
375 108 3.8 2.7
Jobs(1) 138 409 4.5 2.8
Jobs(2) 157 456 5.7 5.6
Average 301 163 3.8 2.8

Latent Dirichlet Allocation

for dynamical O/D matrices analysis

Local stationarity of BSS behaviour / OD Small bags of successive trips $\approx$ stationarity of OD
Documents (bags of words) = bags of successive trips (5000)
With :

Latent Dirichlet Allocation

For dynamic Origine-Destination matrices analysis


For each latent activity$a$ , draw its template : $\Lambda_a\sim\mathcal{D}(\beta)$
For each bag of trips
Draw the proportion of activities : $\pi_t \sim \mathcal{D}(\alpha)$
For each trips
    Draw its activity
    $A \sim \mathcal{M}(1,\pi_t)$
    Draw an OD using the activity template
    $D \sim \mathcal{M}(1,\Lambda_A)$

OD (Tensor) decomposition results

Model selection with perplexity analysis
(clear drop for K=5)

Latent activity template $\Lambda_a$

Draw Nt, (number of trips) using $\Lambda_a$ : $$OD\sim\mathcal{M}(Ndep,\Lambda_a)$$ Compute the balance (incoming bikes - leaving bikes) for a station $s$ : $$B_s=\sum_jOD_{js}-\sum_jOD_{sj}$$ Compute the expectation for each station $\mathbb{E}[\mathbf{B}]=Ndep(\Lambda_a^t-\Lambda_a)\mathbf{1}$

Stations Balances : home→work

Stations Balances: work→home

Stations Balances : begining of evening

Gravity-LDA

LDA extension for taking stations context into-accounts

Replace the O/D matrix templates $\Lambda_k$ by a parametric form which depends on : \begin{equation} \Lambda_{\,uv}(\Theta_k) \,=\, \frac{ \exp(\mathbf{\theta_{k}^{d}}^\top \mathbf{x_{u}} + \mathbf{\theta_{k}^{a}}^\top \mathbf{x_{v}} + \mathbf{\theta_{k}^{da}}^\top \mathbf{x_{uv}^{da}}) } {\sum\limits _{u,v}\; \exp(\mathbf{\theta_{k}^{d}}^\top \mathbf{x_{u}} + \mathbf{\theta_{k}^{a}}^\top \mathbf{x_{v}} + \mathbf{\theta_{k}^{da}}^\top \mathbf{x_{uv}^{da}}) } \end{equation}

Inspiration from gravity models for O/D matrices

Gravity-LDA

LDA extension for taking stations context into-accounts

Replace the O/D matrix templates $\Lambda_k$ by a parametric form which depends on : \begin{equation} \Lambda_{\,uv}(\Theta_k) \,=\, \frac{ \exp(\mathbf{\theta_{k}^{d}}^\top \mathbf{x_{u}} + \mathbf{\theta_{k}^{a}}^\top \mathbf{x_{v}} + \mathbf{\theta_{k}^{da}}^\top \mathbf{x_{uv}^{da}}) } {\sum\limits _{u,v}\; \exp(\mathbf{\theta_{k}^{d}}^\top \mathbf{x_{u}} + \mathbf{\theta_{k}^{a}}^\top \mathbf{x_{v}} + \mathbf{\theta_{k}^{da}}^\top \mathbf{x_{uv}^{da}}) } \end{equation}

Estimation by Collapsed CEM

Conclusion on OD analysis

Extension

Bonus : cities portraits at Night

Bonus : cities portraits : atNight

Transit networks analysis

Smart-card data form


Open-Data ? (ex: "Ile de France Mobilité" in aggregated form)

Smart-card data form


A particular field : user id

A massive dataset


2 year of data

With user ids ?

Without user ids ?

Analyzing in-flow volumes

Profiles of the demand with spatio-temporal variations

Profiles of the demand with spatio-temporal variations

Profiles of the demand with spatio-temporal variations

Between day variations clearly visible (CAH)

Between day variations clearly visible (CAH)

Between day variations clearly visible (CAH)

Between day variations clearly visible (CAH)

Between day variations clearly visible (CAH)

Which is mainly explained by calendar effects


Which can be used to detects outliers


Which can be used to detects outliers

#Rennes #metro #Star des chaises jetées sur la ligne aérienne de métro à Villejean. Dégâts importants. Trafic interrompu pendant 2h?

— Samuel Nohra (@SamuelNohra) 29 mars 2016

Or perform mid-term predictions


Or perform mid-term predictions

User's id for

data enrichment

Enable the reconstruction of a significant portion of the destinations

Enable the reconstruction of a significant portion of the destinations

Enable the reconstruction of a significant portion of the destinations

Enable the reconstruction of a significant portion of the destinations

Enable the reconstruction of a significant portion of the destinations

Data enrichment


72% of reconstructed destinations


→ Aggregation and analysis per Oirigines/Destinations
→ Multimodal exchange hub analysis (C. Richer)
→ Dynamic OD matrices or Line graph of load

User clustering

for user centric analysis

Objectives


Methodology

Objectives


Methodology

Commuter patterns

Mean profile of a cluster with 4.55% of users

Commuter patterns

Mean profile of a cluster with 12.54% of users

Commuter patterns

Mean profile of a cluster with 3.6% of users

But other forms emerge

Mean profile of a cluster with 15.13% of users

But other forms emerge

Mean profile of a cluster with 6.44% of users

But other forms emerge

Mean profile of a cluster with 8.64% of users

Methodology

Using a continuous time description Generative model for continuous time user clustering

Some comments on the results


Integrated classification likelihhod

Model selection and bayesian regularisation for discrete latent variable models

$$ICL_{ex}(\mathbf{X},\mathbf{Z})=\int_{\mathbf{\Theta}}p(x,z;\mathbf{\theta})p(\theta)d\theta$$ $$\hat{Z}=\arg\max_{Z}ICL_{ex}(\mathbf{X},\mathbf{Z})$$

Integrated classification likelihhod

Model selection and bayesian regularisation for discrete latent variable models

$$ICL_{ex}(\mathbf{X},\mathbf{Z})=\int_{\mathbf{\beta}}p(x|z;\mathbf{\beta})p(\beta)d\mathbf{\beta}\int_{\mathbf{\pi}}p(z|\mathbf{\pi})p(\mathbf{\pi})d\mathbf{\pi}$$ $$\pi \sim D(\alpha)$$

Graph clustering: SBM



$$Z_i \,\sim\, \mathcal{M}(1,\pi)$$ $$X_{ij}|Z_{ik}Z_{jl}=1\,\sim\, \mathcal{B}(\beta_{kl})$$

Graph clustering: SBM



Graph clustering: SBM

degree correction



$$Z_i \,\sim\, \mathcal{M}(1,\pi)$$ $$X_{ij}|Z_{ik}Z_{jl}=1\,\sim\, \mathcal{P}(\theta_i\beta_{kl}\theta_j)$$

Graph clustering: SBM

degree correction and oriented graphs



$$Z_i\,\sim\, \mathcal{M}(1,\pi)$$ $$X_{ij}|Z_{ik}Z_{jl}=1 \,\sim\, \mathcal{P}(\theta^{out}_i\beta_{kl}\theta^{in}_j)$$

SBM

simulated data

SBM

simulated data

SBM

simulated data

SBM

simulated data

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

dc-SBM, real data

Blogs politiques (US)

Conclusion


Current works

Thank's for your attention

and to all my colleagues

Latifa Oukhellou

Mohamed El Marhsi

Anne Sarah Briand

Florian Toqué

Cyprien Richer

Nicolas Coulombel

@comeetie