Standardize and append a batch of data¶
Here, we’ll learn
how to standardize a less well curated collection
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.context.uid = "ManDYgmftZ8C0000"
ln.context.track()
→ connected lamindb: testuser1/test-scrna
→ notebook imports: bionty==0.48.3 lamindb==0.76.2
→ created Transform('ManDYgmftZ8C0000') & created Run('2024-08-26 16:57:15.116189+00:00')
Let’s now consider a less-well curated dataset:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Show code cell output
AnnData object with n_obs × n_vars = 70 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
We are still working with human data, and can globally set an organism:
bt.settings.organism = "human"
curate = ln.Curate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={adata.obs.cell_type.name: bt.CellType.name})
• 3 non-validated categories are not saved in Feature.name: ['n_genes', 'percent_mito', 'louvain']!
→ to lookup categories, use lookup().columns
→ to save, run add_new_from_columns
Standardize & validate genes ¶
Let’s convert Gene symbols to Ensembl ids via standardize()
. Note that this is a non-unique mapping and the first match is kept because the keep
parameter in .standardize()
defaults to "first"
:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
• standardized 749/765 terms
! found 5 symbols in Bionty: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
please add corresponding Gene records via `.from_values(['ENSG00000262074', 'ENSG00000233276', 'ENSG00000276168', 'ENSG00000291237', 'ENSG00000254709'])`
Here, we’ll use .raw
:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curate = ln.Curate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})
• 3 non-validated categories are not saved in Feature.name: ['n_genes', 'percent_mito', 'louvain']!
→ to lookup categories, use lookup().columns
→ to save, run add_new_from_columns
curate.validate()
✓ var_index is validated against Gene.ensembl_gene_id
• mapping cell_type on CellType.name
! 9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
→ save terms via .add_new_from('cell_type')
False
curate.add_validated_from_var_index()
Standardize & validate cell types ¶
Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData
object as a synonym to the top match found in the public ontology.
bionty = bt.CellType.public() # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
# search the public ontology and use the ontology id of the top match
ontology_id = bionty.search(name).iloc[0].ontology_id
# create a record by loading the top match from bionty
record = bt.CellType.from_source(ontology_id=ontology_id)
name_mapper[name] = record.name # map the original name to standardized name
record.save()
record.add_synonym(name)
Show code cell output
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
! CellType records from source (cl, 2024-05-15) are already in the database!
→ pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
! CellType records from source (cl, 2024-05-15) are already in the database!
→ pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✓ loaded 1 CellType record matching ontology_id: 'CL:0000860'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
! CellType records from source (cl, 2024-05-15) are already in the database!
→ pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002051'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000952'
! CellType records from source (cl, 2024-05-15) are already in the database!
→ pass `update=True` to update the records
We can now standardize cell type names using the search-based mapper:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)
Now, all cell types are validated:
curate.validate()
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
True
Register ¶
artifact = curate.save_artifact(description="10x reference adata")
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/HPUFoORkUU7F3gh80000.h5ad')
✓ storing artifact 'HPUFoORkUU7F3gh80000' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/HPUFoORkUU7F3gh80000.h5ad'
• parsing feature names of X stored in slot 'var'
✓ 749 terms (100.00%) are validated for ensembl_gene_id
✓ linked: FeatureSet(uid='NCr6WEtpoXCU2w1mARr7', n=749, dtype='float', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4FwgA', created_by_id=1, run_id=2)
• parsing feature names of slot 'obs'
✓ 1 term (25.00%) is validated for name
! 3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✓ linked: FeatureSet(uid='UiOFXFFS4F6BkgOAvWCr', n=1, registry='Feature', hash='czSO5pGKlZp1hI1QB8ftGQ', created_by_id=1, run_id=2)
✓ saved 2 feature sets for slots: 'var','obs'
artifact.view_lineage()
Append the dataset to the collection¶
Query the previous collection:
collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")
Create a new version of the collection by sharding it across the new artifact
and the artifact underlying version 1 of the collection:
collection_v2 = ln.Collection(
[artifact, collection_v1.artifacts.all()[0]],
revises=collection_v1,
).save()
Show code cell output
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding artifact ids [1] as inputs for run 2, adding parent transform 1
If you want, you can label the collection’s version by setting .version
.
collection_v2.version = "2"
collection_v2.save()
Collection(uid='scwQcBg7xRPKgDuD0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='dBJLoG6NFZ8WwlWqnfyFdQ', visibility=1, created_by_id=1, transform_id=2, run_id=2, updated_at='2024-08-26 16:57:38 UTC')
Version 2 of the collection covers significantly more conditions.
collection_v2.describe()
Collection(uid='scwQcBg7xRPKgDuD0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='dBJLoG6NFZ8WwlWqnfyFdQ', visibility=1, updated_at='2024-08-26 16:57:38 UTC')
Provenance
.created_by = 'testuser1'
.transform = 'Standardize and append a batch of data'
.run = '2024-08-26 16:57:15 UTC'
Feature sets
'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
'obs' = 'donor', 'tissue', 'cell_type', 'assay'
View data lineage:
collection_v2.view_lineage()