Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information.
Yale University researchers attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. They present information theoretic measures for the degree to which one can genotype these deletions. They then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, the researchers present an anonymization method removing much of the leakage from signal profiles.
Illustration of the attack scenario
a The adversary starts the attack with a signal profile dataset (S). This dataset contains a genome-wide signal profile and also sensitive information (e.g., HIV status) for each individual. The names are anonymized into IDs as shown in the blue shaded column. The adversary uses an SV panel (P S ) in the attack. This panel can be obtained from outside (1) or the adversary can use the genome-wide signal profiles to discover the panel (2), as denoted by the shaded red arrows. The adversary then genotypes the SVs (3) in the panel and builds the dataset for genotyped SVs (G˜). b The adversary acquires an SV panel (P G ) and genotype dataset (G), which contains the genotypes of SVs in the panel for a large number of individuals. In order to link the genotyped SV dataset (G˜) to the SV genotype dataset, the adversary compares their SV panel (P S ) to the SV panel (P G ). For the matching SVs, the adversary compares the genotypes. The individuals in G who have good matches with respect to genotype distance are linked to signal profile individuals, as indicated by the matching of colored columns. This linking reveals the HIV status of the individuals in the genotype dataset. c This example shows a large deletion in the NA12878 individual and how it affects signal profiles. A 70 kb long region is deleted in the NA12878 individual and the decrease in signal profiles show the loss of signal along the deletion. d This schematic shows large and small deletions and how they are manifested in signal profiles. The large deletions show a large decrease in the signal profiles, while the small deletions have much smaller footprints
Availability – The source code for linking attacks and anonymization can be obtained from http://privasig.gersteinlab.org.