Ancestry prediction with clinical exome

Hi all!

I was trying to run GenoTools (1.3.2) for ancestry prediction using clinical exomes, so I bulit my own reference panel using the overlapping variants between clinical exomes and WGS data available from gnomAD [HGDP + 1KG](https://gnomad.broadinstitute.org/data#v3-hgdp-1kg-tutorials) and ran genotools with the following command:
```
genotools --pfile pf_exome_all_chrs_merged_par \
          --out /net/beegfs-hpc/work/fangz/GP2/pf_exomes/glnexus_joint_calling/genotools/ancestry/pf_exome \ 
          --full_output True \
          --ancestry \
          --ref_panel /net/beegfshpc/work/fangz/GP2/pf_exomes/glnexus_joint_calling/genotools/ref_panel/genotools_inputs/hgdp_tgp_ref_panel_with_var_id \
          --ref_labels /net/beegfshpc/work/fangz/GP2/pf_exomes/glnexus_joint_calling/genotools/ref_panel/genotools_inputs/hgdp_tgp_ref_panel_labels.txt
```

I ran into the error below: 
```
Labeled Reference Ancestry Counts:
label
AFR    753
EAS    727
SAS    683
EUR    573
AMR    395
MID    137
FIN     92
OCE     27
MDE     13
Name: count, dtype: int64

Getting Common SNPs
/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/utils.py:414: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory
=False.
  bim1 = pd.read_csv(f'{geno_path1}.bim', sep='\t', header=None)
Training Balanced Accuracy: 0.8438636213357675
Training Balanced Accuracy; 95% CI: (0.8170917613141708, 0.8706354813573642)
Best Parameters: {'umap__a': 1.0, 'umap__b': 0.75, 'umap__n_components': 15, 'umap__n_neighbors': 5, 'xgb__lambda': 0.001}
Balanced Accuracy on Test Set: 0.9691176470588235
Balanced Accuracy on Test Set, 95% Confidence Interval: (0.956114602315253, 0.982120691802394)
Traceback (most recent call last):
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/bin/genotools", line 8, in <module>
    sys.exit(handle_main())
             ^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/__main__.py", line 157, in handle_main
    out_dict['ancestry'] = execute_ancestry_predictions(args_dict['geno_path'], args_dict['out'], args_dict, ancestry, tmp_dir)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/pipeline.py", line 106, in execute_ancestry_predictions
    ancestry_dict = ancestry.run_ancestry()
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/ancestry.py", line 1129, in run_ancestry
    pred = self.predict_ancestry_from_pcs(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/ancestry.py", line 620, in predict_ancestry_from_pcs
    projected = self.predict_admixed_samples(projected, train_pca)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/genotools/ancestry.py", line 887, in predict_admixed_samples
    birch.fit(cas_train_cluster[['PC1','PC2','PC3']])
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/sklearn/cluster/_birch.py", line 524, in fit
    return self._fit(X, partial=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/sklearn/cluster/_birch.py", line 530, in _fit
    X = self._validate_data(
        ^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/sklearn/base.py", line 633, in _validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/beegfs-hpc/home/fangz/miniforge3/envs/GenoTools/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1082, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required by Birch.
```

I pushed the pfiles of clinical exomes [here](https://console.cloud.google.com/storage/browser/gtserver-eu-west4-gp2-release-terra/clinical_exome_test) and the variant list [here](https://storage.cloud.google.com/gtserver-eu-west4-gp2-release-terra/clinical_exome_test/clinical_exome_variants.txt).

Note that I made the mistake not renaming MID to MDE.
However, I am missing CAS according to Dan even I leave out OCE.
Let me know if there's any file I need to upload to help with troubleshooting!!

Thanks!
Zih-Hua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestry prediction with clinical exome #202

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Ancestry prediction with clinical exome #202

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions