Skip to content

Revisit time discretization for path rescaling #467

@nspope

Description

@nspope

The way rescaling intervals are set in tsdate is by taking quantiles of "mutational area" or "mutational path length"; e.g. dividing time into bins such that an equal amount of area/path length is in each bin. In the latter case, it's possible to cook up scenarios where these quantiles can get heavily skewed towards older times (basically when there's a ton of polytomies and a ton of samples). In which case the adjustment is too coarse in recent times.

For example, here is an example where there's a bunch of artefactual polytomies and 40k samples, using the default settings (x-axis true node ages, y-axis inferred node ages):

Image

where it's clear that there's only a single rescaling interval from 0-100 generations. Upping the number of intervals by 10x gives:

Image

One solution would be to use a fixed logarithmic grid, collapsing out bins with zero mutational mass (but this needs some thought, as it might blow up in the lower tail). Further, I'm not sure if this is actually a problem with real data (the examples above are pathological by design, tsinfer makes nowhere near that many polytomies), but I worry it would start to be with UKB numbers of samples. So it'd be worth seeing what the time discretization looks like on UKB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions