Parameters#
Network parameters#
In the example, basically, the network is operated in two levels of blocks:
The feature levels: the number of genes \(n_g\), the number of adts \(n_a\), the number of peaks \(n_p\). The related parameters are
dim_input_arranduni_block_names(meaning that they have the same length).The subconnection level: the number of genes (\(n_g\)), the number of ADTs (\(n_a\)), the number of peaks in chrom 1 (\(n_p^1\)), the number of peaks in chrom 2 (\(n_p^2\)), etc. The related parameters include
dist_block,dim_block_embed,dim_block_enc,dim_block_dec, andblock_names.
We explain the parameters as below:
dim_input_arrrepresents the size of input features. In the example, it is simply \([n_g, n_a, n_p]\).dim_blockrepresents the number of subconnected features in all modalities (assuming that the features have been rearranged accordingly). In the example, it is \([n_g, n_a, n_p^1, n_p^2, \ldots]\).dist_block: There are four distributions implemented: ‘NB’, ‘ZINB’, ‘Bernoulli’, ‘Gaussian’ for negative binomial, zero-inflated negative binomial, Bernoulli, and Gaussian, respectively. However, only ‘NB’ and ‘Bernoulli’ were tested and used to generate the results for the paper. ‘Bernoulli’ is used for ATAC-seq data, and ‘NB’ is used for genes and proteins.dim_block_embedrepresents the embedding dimension of the binary mask. For example,dim_block_embed = [1, 2, 3, ...]means the mask will be embedded into a continuous vector of dimension 1 for block 1, and so on.dim_block_encrepresents the structure of the first latent layer of the encoder. Using skip-connection helps reduce memory and computation complexity. In the example,dim_block_enc = np.array([256, 128] + [16 for _ in chunk_atac])means that the genes will be embedded into a vector of dimension 256, the adts will be embedded into a vector of dimension 128, and so on. For blocki, we have a sub-network that takes both the features of sizedim_input_arr[i]and the mask embedding of sizedim_block_embed[i]and outputs a vector of sizedim_block_enc[i]. After that, the embedding vectors in all blocks will be concatenated into a vector as the input to the encoder.Similarly,
dim_block_decrepresents the structure of the last latent layer of the decoder. For blocki, we have a sub-network that takes latent features of sizedim_block_dec[i]and outputs a vector (the predicted features) of sizedim_input_arr[i].dimensionsanddim_latentspecify the network structure in the middle. For example,dimensions = [256, 128]anddim_latent = 32mean that we have a network \(n_{in}-256-128-32-128-256-n_{out}\) where \(n_{in}\) is the sum ofdim_block_enc, and \(n_{out}\) is the sum ofdim_block_dec.
Hyperparameters#
Some of the important hyperparameters are:
beta_unobsrepresents that weight for unobserved features.beta_unobs=0.5by default.beta_reverserepresents the weight for the reverse prediction loss (use unobserved to predict observed).beta_reverse=0by default.beta_klrepresents the weight for the KL divergence loss.beta_kl=2by default.skip_connrepresents whether to use skip connections between the encoder and decoder, which is useful for imputation but may hurt latent representation learning.skip_conn=Falseby default.p_featrepresents the probability of masking for the individual features. The larger value ofp_featencourages imputation ability but also requires more training epochs to have a good performance. But the influence of it is not large when training for enough epochs, so we recommend fixing fixp_featas any reasonable value, e.g. 0.2.p_feat=0.2by default.p_modalrepresents the probability of masking out one modality. It is set as a uniform distribution by default.mean_vals,min_vals, andmax_vals. By default, for Gaussian features,mean_valsis the observed means,min_valsis \(\min\{\) minimum of observed values of peptide i , ‘mean - 3 * sigma’ of observed values of peptide i\(\}\), andmax_valsis defined analogously. For Poisson and Negative Binomial,mean_valsis not used,min_valsis zero andmax_valsis the observed maximums of the corresponding block.
One can use reasonable values as in the example, except the following parameter requires some care depending on your data:
beta_modalrepresents the importance of each modality. You run the model on your dataset for a few epochs and pickbeta_modalsuch that the likelihoods (which will be printed during training) of all modalities are roughly in the same order. Notably, the number of peaks is generally very large, so its likelihood will have a higher value. And that is why you can see it has a small weight 0.01, in the example wherebeta_modal = [0.14,0.85,0.01].