1_1_dfb
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| 1_1_dfb [2026/04/14 00:33] – [Oncoarray Consortium Wiki: Dartmouth Edition] 93.95.115.235 | 1_1_dfb [2026/04/15 23:57] (current) – [Integrarray QC Guidelines – November 25, 2025] 93.95.115.235 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ======Oncoarray Consortium Wiki: Dartmouth Edition====== | + | ======Integrarray QC Guidelines – November 25, 2025====== |
| - | These pages are for sharing information among the consortia genotyping the Illumina Oncoarray chip. | ||
| - | There are various [[Oncoarray: | + | ==== 1. Genotype Calling ==== |
| - | ---- | + | Genotypes were called for many samples at the Center for Inherited Disease Research (CIDR). The genotyping platform used was the Global Screening Plus Custom Array (here called the GSA array but also referenced as product by BPM Amos_Custom_20032937X382854_A1, |
| + | ===== 2. Sample QC ===== | ||
| - | ===== Genotype Calling ===== | ||
| - | Information regarding genotype calling at Dartmouth can be found {{: | + | ==== 2.1 Initial call rate filtering (by groups) ==== |
| - | ---- | + | We propose to use less stringent QC filters than those implemented by CIDR. |
| + | Outlined below are the QC steps that will ensure adequate alignment with these processes. | ||
| + | |||
| + | * Exclude samples with call rate <80%. | ||
| + | * Then, exclude SNPs with call rate <80%. | ||
| + | * Then, exclude samples with call rate <95%. | ||
| + | * Then, exclude SNPs with call rate <95% . | ||
| - | ===== Positions/ | + | ==== 2.2 Duplicate calling concordance |
| - | This {{: | + | |
| - | The SNPs that have been matched | + | If possible, consortia should share at least 10, |
| - | Descriptions of retrieving your requesting markers | + | For duplicated samples |
| - | The List of Probe Sequences can be found in this file | ||
| - | {{ : | ||
| - | ---- | ||
| - | ===== Quality Control Components ===== | ||
| - | ==== QC Process | + | ==== 2.3 Duplicate concordance |
| - | The QC process is documented {{: | + | |
| - | The number of SNPs excluded at each step is summarized in this {{: | + | Identify duplicates within study. |
| - | === SNP QC Exclusion Lists === | + | Verify expected duplicates: if the duplicate samples |
| - | == Duplicate Probes == | + | |
| - | There are a number of variants on the chip with the same probe in the same position (or a few with the same alleles but the sequence from the opposite strand.) The probe with the worse QC scores and call rate is chosen for exclusion. | + | |
| - | ---- | + | Identify unexpected duplicates within studies: Liaise with study data-managers to attempt to resolve any discrepancies, |
| - | ===== Tracking of Incoming Data ===== | + | |
| - | ---- | + | This step has already been completed by CIDR for samples run by it. |
| - | ===== Link To Proposals ===== | + | |
| - | ---- | + | ==== 2.4 Sex checks ==== |
| + | Check that the reported sex of each participant aligns with the genetically inferred sex. The --check-sex option in PLINK will be used to assess consistency between reported and inferred sex. The pseudoautosomal regions need to be excluded first using the --split-x function. | ||
| + | Males with low Y-intensity can be retained. When there are irreconcilable differences between the reported and inferred sex, we will drop those individuals as they may indicate errors in sample processing. The inbreeding coefficient (F value) derived from X-chromosome heterozygosity, | ||
| - | === SNPs for Sex Checking === | ||
| - | For checking the sex of samples use the 300 Y markers confirmed to work in males and to have a non-autosomal pattern in their cluster plots, seen in this [[http:// | ||
| - | ---- | + | We will include XO, XXY, and XYY karyotypes when they occur but mark these unusual sex chromosome patterns. These karyotypes were likely called correctly in GenomeStudio; |
| + | ==== 2.4 Ancestry ==== | ||
| - | ==== SNPs to Exclude before Imputation ==== | + | A set of ~ 10,000 uncorrelated markers will be selected for ancestry inference. These markers will be used to classify individuals as European/ |
| - | The list used in Cambridge | + | |
| - | ---- | ||
| + | When performing PCA for ancestry inference, we usually exclude the 6p (HLA) region and regions of 8p23 and 17q21 that show inversions and strong LD patterns. There are additional regions with strong LD patterns that should to be considered for exclusion. Many of these regions have inversion polymorphisms driving the strong LD patterns (PMID: [[https:// | ||
| - | ==== Completed Analyses | + | ==== 2.5 Heterozygosity |
| - | ---- | + | Exclude samples with heterozygosity <5% or > 40%, or with heterozygosity deviation if p< |
| + | ==== 2.6 Relatives ==== | ||
| - | ==== Lung and Head and Neck Oncoarray project ==== | + | To identify relatives, it is important to establish a set of markers with low fixation index (FST) across major populations. The UNM group will provide a set of the markers from the GSA component of the array. We use KING to identify relatives. For fixed-effects analysis, relatives who share an identify-by-descent (IBD) coefficient greater than 0.125 will be removed to prevent inflation of type I error. For mixed-effect model, all individuals will be retained as the model accounts |
| - | {{: | + | |
| - | ---- | + | ==== 2.7 Cross study duplicates ==== |
| + | It is useful if there are some shared duplicates across groups. If HapMap controls were studied we can provide a list of those that were genotyped by CIDR. | ||
| + | |||
| + | ===== 3. SNP QC across groups ===== | ||
| + | |||
| + | |||
| + | ==== 3.1 Call rate ==== | ||
| + | |||
| + | |||
| + | * Exclude SNPs zeroed by the cluster file with no genotypes. | ||
| + | * Exclude samples with call rate <80% | ||
| + | * Exclude SNPs with call rate <80% | ||
| + | * Exclude samples with call rate <95% | ||
| + | * Exclude SNPs with call rate <95% | ||
| + | ==== 3.2 Hardy-Weinberg ==== | ||
| + | Check Hardy-Weinberg: | ||
| + | |||
| + | ===== 4. SNP QC Exclusions Combined Across Groups ===== | ||
| + | |||
| + | ==== 4.1 Combine list of failures ==== | ||
| + | |||
| + | |||
| + | All consortia should exclude SNPS that fail call rate or HWE thresholds in any participating group. | ||
| + | |||
| + | |||
| + | ==== 4.2 Duplicate probes ==== | ||
| + | |||
| + | |||
| + | There are a number of variants on the chip sharing the same probe position (or a few with the same alleles but the sequence from the opposite strand.) I will ask CIDR if they identified any duplicated probes. Note that while the standard Gentrain algorithms exclude triallelic SNPs, this is not strictly necessary, and we may choose to retain these. | ||
| + | |||
| + | ===== 5. Additional Steps Before Imputation ===== | ||
| + | |||
| + | ==== 5.1 Rare SNPs with poor call rate ==== | ||
| + | |||
| + | Exclude SNPs with call rate below 95% and MAF <0.001 in any group from the imputation input files. However, genotyped calls for these SNPs will remain available for analysis. | ||
| + | |||
| + | ==== 5.2 Non-ideal cluster plots ==== | ||
| + | |||
| + | SNPs flagged as Possible (P) or Subset interference (S) in the second round of cluster plot checking will be excluded. These are either rare SNPs where there is no clear heterozygote cluster or SNPs with more than three clouds due to interference from other SNPs or potential copy number variation. | ||
| + | |||
| + | |||
| + | ==== 5.3 Imputation standard ==== | ||
| + | |||
| + | Imputation preparation for variants from chromosomes 1–22 and X was performed using preparation tool (version 4.3.0) available at https:// | ||
| + | |||
| + | ===== 6. Principal Components ===== | ||
| + | |||
| + | **PCs will be** //Defined for the Integrarray and validated against some consortium-specific PC definitions.// | ||
| + | |||
| + | The figure below describes either i) classification of ancestry using PCA (shown by most likely descent ellipses) or ii) assignment of continental origin to individuals based on the closest location on the continental ancestry triangle. We prefer the latter approach as the ancestry can then be used as a covariate in analyses or for subsequent selection. | ||
| + | {{ : | ||
| + | |||
| + | (All lists will become available on the Integarray wiki: https:// | ||
| + | |||
| + | currently request an email to receive the information from ciamos@salud.unm.edu) | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | ===== Analytical plans ===== | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | Main Integral Site | ||
| + | The Integral analytical team has developed and implemented procedures using a mixed-effect model with the SAIGE program (https:// | ||
| + | |||
| + | KING to estimate the relationships among all the individuals in the sample and so ensure that test statistics for association are not inflated due to inclusion of related individuals (which can reduce standard errors and inflate test statistics). Second, one has to include principal components to adjust for ancestry and other factors that may also cause correlations among samples. To control possible noise, we will include the first 10 principal components. Because we will use data from several different studies in a mega-analysis, | ||
| + | |||
| + | |||
| + | ===== Collaborating sites ===== | ||
| + | |||
| + | |||
| + | We imagine that collaborating centers will rather use a fixed effects model. With fixed effects modelling, the first step is to remove related individuals. The same approach to identify markers with low FST across ethnic populations should be applied, otherwise within population similarity in allele frequencies will inflate the IBD estimates derived from KING or other similar software, leading to an unnecessary reduction in sample size. Second, individuals showing IBD values more than 0.10 should be removed to reduce false positives due to relatives decreasing the standard errors. Third, Fastpop or similar programs should be used to classify individuals into disjoint populations that show similar allele frequencies to avoid confounding between allele frequencies and case/ | ||
| + | |||
| + | |||
| + | ===== Covariates and stratifying variables. ===== | ||
| + | |||
| + | |||
| + | We have consistently found that different genetic factors are prominent by histological subgroups. Therefore, we will perform analyses for overall lung cancer and also according to major strata that include adenocarcinoma, | ||
| + | |||
| + | |||
| + | ===== Transcriptome-wide association study (TWAS): ===== | ||
| + | |||
| + | |||
| + | There has been an increasing emphasis in understanding mechanisms by which genetic factors influence lung cancer risk. In particular, single-cell TWAS has emerged as a very prominent approach for understanding how specific genes and SNPs influence cancer development. We anticipate that a post-GWAS analysis will include both bulk and single-cell TWAS to help with the interpretation of findings from this study. If this is not feasible, we will perform bulk RNA-based TWAS analysis. | ||
| + | |||
| + | |||
| + | ===== Post-GWAS annotation of findings. ===== | ||
| + | |||
| + | |||
| + | Our Harvard collaborators have developed FAVOR (https:// | ||
| + | |||
| + | |||
| + | ===== Molecular validation of findings. ===== | ||
| + | |||
| + | |||
| + | If possible, we will establish collaboration with molecular biologists to validate some of the top new findings from this study. We will provide suggestions for validations to collaborators once the main analysis has been completed so that the molecular biologists have sufficient time to pursue studies. | ||
| + | |||
| + | |||
| + | |||
| + | Status of pipeline: 11/25/2025 | ||
| + | |||
| + | Overall PCA for independent markers: | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | The QC steps prior to imputation were refined based on Chris’ suggestions. | ||
| + | |||
| + | The main change is that unrelated samples, together with 89 CEU samples, were clustered and visually inspected. | ||
| + | |||
| + | Samples within the circle around the CEU centroid (highlighted in red) represent a stringent cutoff for defining a homogeneous European cluster. | ||
| + | |||
| + | The distance to the centroid is somewhat arbitrary, but is intentionally conservative in the current setting. | ||
| + | |||
| + | |||
| + | No samples were dropped; all samples will be included in the imputation, the marker lists of all steps can be found in file: {{ : | ||
| + | |||
| + | * **735,370** markers in the original release (chr1–chr23) | ||
| + | * **942** markers with ≥3 discordant calls among 149 duplicate pairs, missing genotypes were not included | ||
| + | * **3,552** markers with p < 1×10⁻⁷ in 14,537 unrelated caucasian controls | ||
| + | * **2,052** markers with p < 1×10⁻¹² in 9895 unrelated caucasian cases | ||
| + | * **23,112** markers with call rate < 0.95 | ||
| + | * **27,119** total unique markers affected by filters discordant calls, HWE and call rate | ||
| + | * **708,672** markers remaining after removing the filtered markers | ||
| + | * **632,677** markers retained after running the McCarthy Group Tool workflow against the TOPMed R3 reference panel | ||
| - | ==== Other Pages Containing Information ==== | ||
| - | * [[Lung AffymetrixArray|Lung AffymetrixArray]] | ||
| - | * [[TRICL]] | ||
1_1_dfb.1776126791.txt.gz · Last modified: by 93.95.115.235
