I'm getting Mean of empty slice
runtime warnings.
When I print out what my variables are (numpy arrays), several
of them contain nan
values. The Runtime Warning is looking at line
58 as the issue. What can I change to make it work?
Sometimes the program will run with no issues. Most times it does not.
This is a K-Means from scratch algorithm that is clustering the iris data set. It first prompts the users for the amount of centroids they want (clusters). It then randomly generates said number of clusters in the given range from the numbers in the loaded in text file.
I have the break value in the else statement to prevent infinite loops.
Is it because I am having numbers go below zero when I subtract the Centroids from the data points in the file?
Error I get when I run:
How Many Centrouds? 3
Dimensionality of Data: (150, 4)
Starting Centroiuds:
[[ 1.4 7.9 0.2 3.4]
[ 7.8 0.2 4.3 1.4]
[ 5.7 6.9 3. 6.6]]
t0 :
[[[-3.7 4.4 -1.2 3.2]
[ 2.7 -3.3 2.9 1.2]
[ 0.6 3.4 1.6 6.4]]
[[-3.5 4.9 -1.2 3.2]
[ 2.9 -2.8 2.9 1.2]
[ 0.8 3.9 1.6 6.4]]
[[-3.3 4.7 -1.1 3.2]
[ 3.1 -3. 3. 1.2]
[ 1. 3.7 1.7 6.4]]
...,
[[-5.1 4.9 -5. 1.4]
[ 1.3 -2.8 -0.9 -0.6]
[-0.8 3.9 -2.2 4.6]]
[[-4.8 4.5 -5.2 1.1]
[ 1.6 -3.2 -1.1 -0.9]
[-0.5 3.5 -2.4 4.3]]
[[-4.5 4.9 -4.9 1.6]
[ 1.9 -2.8 -0.8 -0.4]
[-0.2 3.9 -2.1 4.8]]]
Warning (from warnings module):
File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 59
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 68
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
---------------
Starting Centroids:
[[ 1.4 7.9 0.2 3.4]
[ 7.8 0.2 4.3 1.4]
[ 5.7 6.9 3. 6.6]]
Starting NewMeans:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
Starting Centroids Now:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
NewMeans now:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
Python Code:
import numpy as np
from pprint import pprint
import random
import sys
import warnings
arglist = sys.argv
#UNCOMMENT BELOW IN FINAL PROGRAM
'''
NoOfCentroids = int(arglist[2])
dataPointsFromFile = np.array(np.loadtxt(sys.argv[1], delimiter = ','))
'''
dataPointsFromFile = np.array(np.loadtxt('iris.txt', delimiter = ','))
NoOfCentroids = input('How Many Centrouds? ')
dataRange = ([])
#UNCOMMENT BELOW IN FINAL PROGRAM
'''
with open(arglist[1]) as f:
print 'Points in data set: ',sum(1 for _ in f)
'''
dataRange.append(round(np.amin(dataPointsFromFile),1))
dataRange.append(round(np.amax(dataPointsFromFile),1))
dataRange = np.asarray(dataRange)
dataPoints = np.array(dataPointsFromFile)
print 'Dimensionality of Data: ', dataPoints.shape
randomCentroids = []
data = ([])
templist = []
i = 0
while i<NoOfCentroids:
for j in range(len(dataPointsFromFile[1,:])):
cat = round(random.uniform(np.amin(dataPointsFromFile),np.amax(dataPointsFromFile)),1)
templist.append(cat)
randomCentroids.append(templist)
templist = []
i = i+1
centroids = np.asarray(randomCentroids)
def kMeans(array1, array2):
ConvergenceCounter = 1
keepGoing = True
StartingCentroids = np.copy(centroids)
print 'Starting Centroiuds:\n {}'.format(StartingCentroids)
while keepGoing:
#--------------Find The new means---------#
t0 = StartingCentroids[None, :, :] - dataPoints[:, None, :]
print 't0 :\n {}'.format(t0)
t1 = np.linalg.norm(t0, axis=-1)
t2 = np.argmin(t1, axis=-1)
#------Push the new means to a new array for comparison---------#
CentroidMeans = []
for x in range(len(StartingCentroids)):
CentroidMeans.append(np.mean(dataPoints[t2 == [x]], axis=0))
#--------Convert to a numpy array--------#
NewMeans = np.asarray(CentroidMeans)
#------Compare the New Means with the Starting Means------#
if np.array_equal(NewMeans,StartingCentroids):
print ('Convergence has been reached after {} moves'.format(ConvergenceCounter))
print ('Starting Centroids:\n{}'.format(centroids))
print ('Final Means:\n{}'.format(NewMeans))
print ('Final Cluster assignments: {}'.format(t2))
for x in xrange(len(StartingCentroids)):
print ('Cluster {}:\n'.format(x)), dataPoints[t2 == [x]]
for x in xrange(len(StartingCentroids)):
print ('Size of Cluster {}:'.format(x)), len(dataPoints[t2 == [x]])
keepGoing = False
else:
print 15*'-'
ConvergenceCounter = ConvergenceCounter +1
print 'Starting Centroids:\n'
print StartingCentroids
print '\n'
print 'Starting NewMeans:\n'
print NewMeans
StartingCentroids =np.copy(NewMeans)
print 'Starting Centroids Now:\n'
print StartingCentroids
print '\n'
print 'NewMeans now:'
print NewMeans
break
kMeans(centroids, dataPoints)
I assume the warning comes up in
If
t2 == [x]
is all False (no match betweent2
andx
, thendataPoints[...]
will be an empty array, resulting in themean
warning.I think you need to be more careful with that test. Maybe even skip the
mean
if the masked array is empty.==
tests with floating values are unpredictable. You need to use something likenp.isclose
ornp.allclose
to test equivalence with a tolerance.The second warning comes from later in the
mean
calc, presumably when trying to divide by 0, the number of elements.The full
mean
code can be found innumpy.core._methods.py
.In sum, don't try to take the
mean
of an empty array.