API: Preprocessing
sc.pp.filter_cells()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.filter_cells.html#scanpy.pp.filter_cells
Filter cell based on feature number and/or UMI number
sc.pp.filter_cells(data=, min_genes=, max_genes=, min_counts=, max_counts=, inplace=)
data=: the AnnData object used
min_genes=: minimum number of expressed gene to pass the filter; default is "None"
max_genes=: maximum number of expressed gene to pass the filter; de ault is "None"
min_counts=: minimum number of UMI to pass the filter; default is "None"
max_counts=: maximum number of UMI to pass the filter; default is "None"
inplace=: replace the raw data with new data; default is "True"
sc.pp.filter_genes()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.filter_genes.html#scanpy.pp.filter_genes
Filter gene based on cell number and/or UMI number
sc.pp.filter_genes(data=, min_cells=, max_cells=, min_counts=, max_counts=, inplace=)
data=: the AnnData object used
min_cells=: minimum number of expressed cell to pass the filter; default is "None"
max_cells=: maximum number of expressed cell to pass the filter; default is "None"
min_counts=: minimum number of UMI to pass the filter; default is "None"
max_counts=: maximum number of UMI to pass the filter; default is "None"
inplace=: replace the raw data with new data; default is "True"
sc.pp.calculate_qc_metrics()
caculate the quality-control indices for cells
sc.pp.calculate_qc_metrics(scanpy_object_used, qc_vars=, percent_top=None, log1p=False, inplace=True)
qc_vars=: the column name used to caculate
percent_top=: top ranked number gene to caculate the expression percentage
log1p=: log1p to the expression data; default isTrue
inplace=: inplace the previous Scanpy object with the new one; default isFalse计算所得数据以新列的形式存放在
.obs中
sc.pp.normalize_total()
normalize the expresssion data (expression matrix)
sc.pp.normalize_total(adata, target_sum=, inplace=True)
target_sum=: 数据normalize后的程度;target_sum=1e6是计算 CPM;target_sum=None: 所有基因的表达值最后乘以所有细胞 UMI Count 的中位数;default=None
inplace=: 是否用新数据替换原有数据;default=True
sc.pp.log1p()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.log1p.html
log1p transformation to the expression data
sc.pp.log1p(data=)
sc.pp.scale()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.scale.html#scanpy.pp.scale
scale data by Z-score
sc.pp.scale(adata, zero_center=, layer=, max_value=)
zero_center=: adjust the mean value to0
layer=: matrix used to scale;default=None, the.Xwill be scaled
max_value=: after scaling, values over this cutoff will be truncated
sc.pp.highly_variable_genes()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html
identity the most highly expressed genes based on the expression dispersion (
normalized variance); normalized and logarithmized matrix will be used for caculation
sc.pp.highly_variable_genes(adata, layer=, flavor=, n_top_genes=, min_mean=, max_mean=, min_disp=, max_disp=)
layer=: appoint the used expression matrix; default isNone, will use theAnnData.Xfor caculation
flavor=: the caculation method of expression dispersion;default='seurat'; other choices includes'cell_ranger','seurat_v3'and'seurat_v3_paper'
n_top_genes=: the number of top highly variable genes; with assignment of this parameters, all the following parameters will be ignored
min_mean=: the minimum cutoff of mean expression;default=0.0125
max_mean=: the maximum cutoff of mean expression;default=3
min_disp=: the minimum cutoff of mean expression;default=0.5
max_disp=: the maximum cutoff of mean expression;default=infAfter caculation, the results will be stored in
AnnData.varNote: 对于
flavor = 'seurat'/'cellranger',需要使用normalized并且log1p后的数据
sc.pp.regress_out()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.regress_out.html#scanpy.pp.regress_out
Regress out unwanted sources of variation
sc.pp.regress_out(adata, layer=, keys=)
layer=: expression matrix used;default=None,.Xwill be used
keys=: features to be regressed out;['total_counts', 'pct_counts_mt']
sc.pp.pca()
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.pca.html#scanpy.pp.pca
PCA, linear dimensional reduction
sc.pp.pca(data=, svd_solver="arpack")
n_comps=: number of PC to be caculated
layer=: expression matrix used;default=None,.Xwill be used
svd_solver=: SVD solver to be used结果存放在
.obsm中,使用.obsm['X_pca']来提取
sc.pp.neighbors()
https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.neighbors.html#
compute the KNN
This KNN will be wildly used in
tSNE/UMAP,clusteringandpseudotime trajectory
sc.pp.neighbors(adata, n_pcs=, n_neighbors=, method=, use_rep=)
n_pcs=: the number of top PCs to use;default=None
n_neighbors=:default=15
method=:default='umap'
use_rep=: the filed of.obsmused to caculate KNN;default=None,.obsm.X_pcawill be used; in the case of data sets integration,obsm.X_pca_harmonyshould be used by assigning'X_pca_harmony'生成的数据存放在
.uns和.obsp中独立数据:
sc.pp.neighbors(scanpy_object, n_neighbors=10, n_pcs=40)
整合数据:sc.pp.neighbors(scanpy_object, n_neighbors=10, n_pcs=40, use_rep='X_pca_harmony')
sc.pp.scrublet()
https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.scrublet.html
Predict doublets using Scrublet
这一步最好使用
raw expression matrix,因此最好在归一化之前进行
sc.pp.scrublet(adata, expected_doublet_rate=, batch_key=)
expected_doublet_rate=: 预期双细胞的比例,5%~10%;default=0.06
batch_key=: 按照此参数给定的信息,分批次运行此函数运行结束后,
.obs会增加名为doublet_score的一列,数值越大,越有可能是doublet;同时.obs中,会给出scrublet对所有细胞是否是doublet的判断,True为双细胞,False为单细胞


