Hdbscan: Crash when allow_single_cluster used with cluster_selection_epsilon

Created on 21 Apr 2020  ·  13Comments  ·  Source: scikit-learn-contrib/hdbscan

When I use cluster_selection_epsilon=x where x > 0 and allow_single_cluster=True together, HDBSCAN crashes.

I am using those two options together to try and get the no_structure toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only use cluster_selection_epsilon, multiple clusters appear in that square. When I only use allow_single_cluster=True, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.

Code

import numpy as np
import hdbscan

if __name__ == '__main__':
    no_structure = np.random.rand(1500, 2)
    clusterer = hdbscan.HDBSCAN(min_cluster_size=15, cluster_selection_epsilon=3, allow_single_cluster=True)
    clusterer.fit(no_structure)

Expected behavior

HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.

Actual behavior

HDBSCAN crashes with the following traceback:

Traceback (most recent call last):
  File "/home/home/PycharmProjects/sandbox/crash_example.py", line 7, in <module>
    clusterer.fit(no_structure)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 632, in hdbscan
    return _tree_to_labels(X,
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 59, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(condensed_tree,
  File "hdbscan/_hdbscan_tree.pyx", line 645, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 631, in hdbscan._hdbscan_tree.epsilon_search
IndexError: index 0 is out of bounds for axis 0 with size 0

All 13 comments

I'm also experiencing this issue. only happens when usingcluster_selection_method='eom' .

In ._hdbscan_tree.pyx the cluster_tree is empty because all of the children of the root node have child size == 1. 'leaf' method can get away with this because 'leaves' is set to empty as well.

However, 'eom_clusters' contains the root node and epsilon_search() relies on cluster_tree only searching child nodes. I think we want to disallow the root node from being in the 'eom_clusters' list?

@danielzgtg please see my pull request for a possible fix to your problem. I am not positive that it does not affect the underlying algorithm, but I did my best to find a solution and I'm pretty confident.

So after reading the HDBSCAN* paper by Malzer and Baum, I think that probably my fix for 370 was not algorithmically correct; the fix allows us to run with that parameter, but does not actually allow the algorithm to return exactly one single cluster.

The description under section 4.3 states that describes the merging based on epsilon "If we find one _before reaching the root_, we select it as a cluster and unselect all of its descendants".

@lmcinnes or @cmalzer would you be able to confirm that for the combination of "allow_single_cluster" and "cluster_selection_epsilon" that we want to _allow reaching the root node_ during the upwards traversal of the hierarchy? Or if that has unintended consequences? This seems like the correct move.

I have made a new commit that I think implements it correctly, and fixes some incorrect behavior I was seeing. I apologize for not doing more thorough research before making my initial commit.

Hi, thanks for pointing that out - I forgot to implement cluster_selection_epsilon for the "allow_single_cluster" option. (The algorithm described in the paper only considered the case where we want to partition the data set)

But I had a look at the latest commit and returning the root in "traverse_upwards" if allow_single_cluster is set to True should indeed fix the issue. Thank you!

Thanks for the quick reply and for your work in designing this algorithm!

@lmcinnes Unfortunately, this issue is not resolved.

The second pull request I made passed a unit test that I was using and allows single clusters in most cases, but there is an edge case with the code provided by @danielzgtg.

I tried to do some debugging and the problem lies appears to be when cluster_tree['lambda_val] is empty on _hdbscan_tree.pyx:638 . I'm having trouble debugging the cython though (still pretty new to cython). If anyone has time to help investigate this, I would greatly appreciate it.

I just had a look at it. The problem is indeed that cluster_tree is empty at this point. Therefore, cluster_tree['lambda_val'][cluster_tree['child'] == leaf] is an empty array and you will get an IndexError when trying to access the first element ([0]).

The reason why cluster_tree is empty is because of the line "cluster_tree = tree[tree['child_size'] > 1]". The tree only contains a parent node with children of size 1.

The code works for cluster_selection_method="leaf" because the list of leaves is empty. For "eom", the parent cluster is passed as "leaf".

This could be solved by checking whether cluster_tree is empty before doing the epsilon_search. For example, "if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0: ... "

I just tested it with your suggestion of cluster_tree.shape[0] > 0 in the conditional , but that fails. After doing a little bit of printf debugging, the cluster_tree had ~26 elements in it, and eom_clusters has exactly one, which is the root node. So, when we evaluate cluster_tree['child'] == leaf on line 637, leaf is the root node and it matches no child entries.

How about this?

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      

    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          

    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if allow_single_cluster:                                                  
                selected_clusters.append(leaf)                                        
                break                                                                 
            continue                                                                  
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:                                           
            if leaf not in processed:                                                 

Honestly I'm a little confused. When we want to allow single cluster, does that mean we want to allow it to be just the noise cluster? In that case it would be :

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      

    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          

    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if not allow_single_cluster:                                              
                selected_clusters.append(leaf)                                        
            break                                                                     
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:  

Seems like I tested with the wrong number of random data points, and then got the same error, but with a different cause.
There are several ways how this can be fixed; it might actually be a little bit faster to do the check outside of epsilon_search, for example something like

if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0:

            eom_clusters = [c for c in is_cluster if is_cluster[c]]
            if not (len(eom_clusters) == 1 and eom_clusters[0] == cluster_tree['parent'].min()):
                selected_clusters = epsilon_search(set(eom_clusters), cluster_tree, cluster_selection_epsilon, allow_single_cluster)

(Sorry for the many edits)

Thanks for the fast response! That sounds fine. I'll put in a pull request tonight.

@cmalzer

I put up a pull request with your suggestion.

The reason I had the check inside the epsilon_search() method is because I was unsure whether or not we could guarantee that when the root node is present in eom_clusters that it is the only item in that list. Otherwise if we only check that one case, the root node will still cause an error in epsilon_search() if it is anywhere present in the eom_clusters list with a mix of other leaves.

If we can't make that assumption then I will change the conditional to

if (cluster_tree['parent'].min() in eom_clusters)

because if the root is present at all in the list and we're allow it as an output then it will by default become the only epsilon child returned from the search.

If HDBSCAN selects a cluster, then it also unselects all of its descendants. Therefore, it actually should not happen that the root cluster AND other clusters are in the list. However, your suggestion of "if .... in eom_clusters" is much easier to read than my version, so I would actually prefer that line of code - although it might be a little bit faster to check the length first because otherwise we search the whole list of clusters for the root node every time we use the epsilon parameter, even though that is a rare case. But I'm not sure how much of a performance difference that makes... (maybe checking the length takes just as long anyway?) I'm fine with either way.

Thanks for clarifying! I think maybe as a middle ground I'll leave it as is for performance and put a comment above.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

eyaler picture eyaler  ·  12Comments

esvhd picture esvhd  ·  7Comments

Berowne picture Berowne  ·  3Comments

arunmarathe picture arunmarathe  ·  4Comments

s0j0urn picture s0j0urn  ·  10Comments