graphico package

Submodules

graphico.cli module

graphico.graphico module

Created on Wed Jan 8 13:37:56 2020 @author: karliskanders Last updated on 01/04/2020

class graphico.graphico.ConsensusClustering(graph, N=20, N_consensus=10, verbose=True, seed=None, edge_bootstrap=False)[source]

Bases: object

Class for determining stable clustering of data by using a 3-step process. First, an ensemble of clustering results is generated by repeatedly applying a clustering algorithm many times (step 1). Then, the ensemble is used to define new edge weights between the graph nodes based on the data point co-clustering occurrences. These weights are then used to generate another “consensus ensemble”, which in practice is very stable and exhibits only minor variations between different clustering runs (step 2). To decide which one of the partitions among the “consensus ensemble” should be designated as the final consensus partition, we use adjusted mutual information to compare all partitions within the consensus ensemble, and choose the one which agrees the best with all of the other partitions (step 3). Presently, we use the Leiden community detection algorithm for clustering the graph into communities. However, this class can be easily adapted to use other graph-based clustering algorithms. The consensus clustering approach used here is an adapted version of the intuitively simple but well-performing “Ensemble Clustering for Graphs” method by Poulin & Theberge (see https://arxiv.org/abs/1809.05578).

property COOC

element (i,j) of this matrix indicates how many times nodes i and j were clustered together.

Type

Co-clustering occurrence matrix

consensus_communities()[source]

Method for finding the consensus clustering partition, i.e., for the steps 2-3 of the clustering procedure.

property consensus_ensemble

List of consensus clustering results (pertaining to step 2 of the clustering procedure) where each clustering result is a list of integers. These integers correspond to cluster labels.

property consensus_partition

Final consensus partition of the clustering procedure

static cooccurrence_matrix(ensemble)[source]

Create the co-clustering occurrence matrix (also called ‘cooccurrence matrix’); This can be quite slow for large graphs with ~10K nodes and probably could be optimized, e.g., with numba. :param ensemble (list of lists of int): List of clustering results, where each clustering result is a list

of integers. These integers correspond to cluster labels.

create_ensemble(N=None, weights='weight')[source]

Generates ensemble of clustering partitions by repeatedly applying a clustering algorithm many times. :param N (int OR None): Ensemble size for the first clustering step. If N==None, use the

class property self.N

Parameters

None) (weights (string OR) – Edge property to use for the community detection

Returns

List of clustering results, where each clustering result is a list of integers. These integers correspond to cluster labels.

Return type

ensemble (list of lists of int)

static describe_partition(partition, verbose=True)[source]

Describes the number of clusters and the number of nodes in each cluster

property ensemble

List of clustering results (pertaining to step 1 of the clustering procedure), where each clustering result is a list of integers. These integers correspond to cluster labels.

static ensemble_AMI(P, v=True)[source]

Calculates pairwise adjusted mutual information (AMI) scores across the clustering ensemble. :param P (list of lists of int): Clustering ensemble, i.e., a list of clustering results, where each

clustering result is a list of integers. These integers correspond to cluster labels.

Parameters

(boolean) (v) – Determines whether information about the results is printed.

Returns

  • ami_avg (float) – Average adjusted mutual information across the ensemble

  • ami_matrix (numpy.ndarray) – The complete matrix with adjusted mutual information scores between all pairs of clustering results

load_ensemble(ensemble, consensus=False)[source]

This method can be used to load an external ensemble. For example, you might have stored an ensemble of clustering results from a previous analysis and would now like to recalculate the consensus partition. :param ensemble (list of lists of int): List of clustering results, where each clustering result is a list

of integers. These integers correspond to cluster labels.

Parameters

(boolean) (consensus) – Determines whether the ensemble should be treated as the initial ensemble (from step 1) or the consensus ensemble (from step 2).

graphico.graphico.build_graph(similarity_matrix, kNN=None, self_connections=False)[source]

Builds an igraph.Graph from the provided similarity matrix and adjacency matrix. :param similarity_matrix (numpy.ndarray): Matrix with similarity values for each pair of nodes/data points (diagonal

is assumed to be 1).

Parameters
  • None) (kNN (int OR numpy.ndarray OR) – If kNN is an int, this method builds a k-nearest neighbour graph with k=kNN; if kNN is a matrix, it assumed to be the adjacency matrix of the graph; if kNN is None, we allow all possible connections when creating the graph.

  • (boolean) (self_connections) – Determines whether self connections are included as part of the k-nearest neighbours.

Returns

Undirected graph where edges have property ‘weight’ corresponding to the values of the ‘similarity_matrix’.

Return type

g (igraph.Graph)

graphico.graphico.build_kNN_matrix(similarity_matrix, kNN, self_connections=False)[source]

Method for building k-nearest neighbour adjacency matrix :param similarity_matrix (numpy.ndarray): Matrix with similarity values for each pair of nodes/data points :param kNN (int): Number of nearest neighbours for each node/data point :param self_connections (boolean): Determines whether self connections are included as part of the k-nearest

neighbours.

Returns

Binary matrix with 1s for each node’s k-nearest neighbours and 0s otherwise.

Return type

kNN_matrix (numpy.ndarray)

graphico.graphico.cluster_affinity_matrix(M, cluster_labels, symmetric=True, plot=True, cmap='Blues')[source]

Calculate each cluster’s affinity to other clusters based on their constituent nodes’ affinities to the different clusters. :param M (numpy.ndarray): Node affinity matrix. :param cluster_labels (list of int): Clustering partition with integers denoting cluster labels. :param symmetric (boolean): If True, ensures that the cluster affinity matrix is symmetric. :param symmetric (boolean): Determines whether the cluster affinity matrix is displayed.

Returns

Cluster affinity matrix, where elements (k,l) indicates the average co-clustering occurrence of cluster k nodes with the nodes of cluster l.

Return type

C (numpy.ndarray)

graphico.graphico.collect_subclusters(l, fpath, session_name, n_total=None)[source]

Method that collects the sub-clusters of clustering hierarchy level l into one table thus yielding the clusters of the clustering hierarchy level l+1. Note that the tables describing the sub-clusters need to follow a specific naming convention. :param l (int): Level of clustering hierachy whose sub-clusters we wish to collect. :param fpath (string): File path where the clustering results are saved at. :param session_name (string): Name of the clustering session that is being analysed. :param n_total (int OR None): Total number of data points/nodes in the dataset

Returns

Final partition of the whole dataset of the level l+1 of the clustering hierarchy. Contains two columns ‘id’ and ‘cluster’ where ‘id’ are the original node IDs and ‘cluster’ is the cluster label. This table is also stored as a CSV table in the same folder as the sub-cluster tables.

Return type

partition (pandas.DataFrame)

graphico.graphico.list_cluster_stability(C, cluster_labels=None)[source]

Prints out the diagonal values of cluster affinity matrix, which can be used as a measurement of cluster stability.

graphico.graphico.node_affinity(cooc_matrix, cluster_labels, normalise=True)[source]

Estimate each node’s affinity to the different clusters based on the ensemble clustering results (if normalise==True, then this can be interpreted as the probability of node belonging to the particular cluster). :param cooc_matrix (numpy.ndarray): Co-clustering occurrence matrix (see also ConsensusClustering.COOC). :param cluster_labels (list of int): Clustering partition with integers denoting cluster labels. :param normalise (boolean): Determines whether node affinities to clusters are normalised by the sum of rows.

Returns

Node affinity matrix with rows corresponding to nodes and columns to clusters. Matrix elements (i,c) indicate the average co-clustering occurrence value between node i and all other nodes in the cluster c (in terms of either absolute or normalised values).

Return type

M (numpy.ndarray)

graphico.graphico.node_affinity_plot(M, cluster_labels, aspect_ratio=0.002, return_matrix=False)[source]

Plot the node affinity matrix created using node_affinity() method. :param M (numpy.ndarray): Node affinity matrix. :param cluster_labels (list of int): Clustering partition with integers denoting cluster labels. :param aspect_ratio (float): Needs to be adjusted for properly displaying the node affinity matrix. :param return_matrix (boolean): Determines whether the function returns the sorted affinity matrix.

graphico.graphico.plot_confusion_matrix(y_true, y_pred, true_labels=None, pred_labels=None, normalize_to=None, plot=True, return_handle=False)[source]

Compares two different partitionings of data, i.e., two seperate clusterings, using a matrix where the entry (k,l) indicates the correspondence between cluster k of the first partition and cluster l of the second partition. For example, if normalize_to==None, the (k,l) entry will show how many points were assigned both to cluster k and cluster l. Another popular use case is a classification task, where we would compare the predicted labels (first partition) and true labels (second partition). :param y_true (list of int): First partition; a list of integers denoting cluster labels :param y_pred (list of int): Second partition; a list of integers denoting cluster labels :param true_labels: Text labels describing the clusters in y_true partition :param pred_labels: Text labels describing the clusters in y_pred partition :param normalize_to (int OR None): Can take the values 0, 1 or None; determines whether the values of the

confusion matrix are normalised with respect to the total number of points in clusters of the y_true partition (normalize_to=1) or y_pred partition (normalize_to=0), or if the values are not normalised at all (normalize_to=None).

Returns

Confusion matrix where the entry (k,l) indicates the correspondence between cluster k of the first partition and cluster l of the second partition.

Return type

(numpy.ndarray)

graphico.graphico.plot_sorted_matrix(C, cluster_labels)[source]

Sorts the rows/columns of a matrix with respect to a set of cluster labels

graphico.graphico.subcluster_nodes(W, l, clusters, nearest_neighbours, fpath, session_name, N_nn, N, N_consensus, random_state=None, edge_bootstrap=False)[source]

Method that selects a particular partition (clustering), takes this partition’s clusters and sub-clusters them further (i.e., splits them apart into small clusters) using the ConsensusClustering class defined above. Note that the first level of the hierarchy that will be sub-clustered needs to be set up manually. We normally refer to it as the 0-th level and assign all data points to the same cluster with a label 0. However, the 0-th level can also be used to exclude some points from the clustering analysis (e.g., the most central skills nodes). Consult the provided workflow examples on how to set up the whole pipeline from a similarity matrix to a hierarchical set of clusters. :param W (numpy.ndarray): Similarity matrix which is used to construct the graph. :param l (int): Level of clustering hierarchy that is sub-clustered further; this is used

to find the file with the particular partitioning.

Parameters
  • 'all') (clusters (list of int OR) – If it’s a list of integers then it sub-clusters only the clusters with the integer label that is contained in this list. If cluster==’all’, then all clusters of Level l are sub-clustered further.

  • ['all']) (nearest_neighbours (list of int OR) – If this is a list of integers, then the method constructs k-nearest neighbour graphs using each of the integers in this list as k, and detects communities in these graphs. Then, the results with different k values are pooled together for the consensus clusestering step. If nearest_neighbours=[‘all’] then all non-zero values of W are used to construct the graph.

  • (string) (session_name) – File path where to save the clustering results.

  • (string) – Name of this clustering session to be used when saving output files.

  • (int) (random_state) – Number of clustering runs for each nearest neighbour value in ‘nearest_neighbours’; normally we always set N_nn = N//len(nearest_neighbours)

  • (int) – See the description of ‘N’ in ConsensusClustering.__init__()

  • (int) – See the description of ‘N_consensus’ in ConsensusClustering.__init__()

  • (int) – See the description of ‘seed’ in ConsensusClustering.__init__()

  • (boolean) (edge_bootstrap) – See the description of ‘edge_bootstrap’ in ConsensusClustering.__init__()

Returns

  • The method does not return anyting but instead saves output files with the

  • results in the designated location. The following outputs are saved

    1. CSV table with the complete ensemble of clusterings (from step 1 of the clustering procedure; see ConsensusClustering class description).

    2. CSV table with the complete ensemble of consensus clusterings (from step 2 of the clustering procedure; see ConsensusClustering class description).

    3. CSV table with the sub-clusters for each processed cluster of Level l. The table has two columns ‘id’ and ‘cluster’, where ‘id’ is the original node id and ‘cluster’ are the new cluster labels.

    4. NPY file that stores the co-clustering occurrence matrix derived from the clustering ensemble from step 1 of the clustering procedure (can be useful for sub-sequent stability assessments; note, however, that it can be on the order of GBs for large graphs).

Module contents

Top-level package for graphico.