I have an excel sheet with over 1000 columns and 11000 rows - all with numeric data. Within the data, there are missing values represented with '*'.
I would like to replace all of the '*' values with the average of the column that it is in.
Doing this manually would take a long time, so is there a formula that would achieve this?
Thanks so much in advanced for any help.
As you have mentioned machine learning I thought I would introduce you to how you could do this with Azure Machine Learning Studio (AML) using a free account.
By using AML you gain access to a number of methods for replacing missing values which are extremely quick. AML has a Clean Missing Data module which exposes methods of replacement such as Multivariate Imputation using Chained Equation, Mean, Median and several others. The great thing here is you can visualize the dataset columns by right clicking on the dataset and see which columns have skew. You can then select on a column by column basis which replacement method to use. If you have heavily skewed columns you might use median instead for instance. This also offers great opportunities for data normalization (scale and reduce). You also gain access to using Python and R with your dataset.
I don't know if there is a method for directly treating
"*"
as missing values, I am trying to find that out, but if you do a little processing in advance of load then all is fine. The step before loading requires:"~*"
for Find and leave Replace blankThen login into AML and click the + New at the bottom of the screen
Select New >
DATASET > FROM LOCAL FILE
and select your fileWhen selecting type ensure to select CSV with no header if you data has no header row or with header if it does:
Your dataset will start uploading as shown by progress bar at bottom of screen and then appear in the
SAVED DATASETS
collection.Click the + New button again and select
EXPERIMENT > BLANK EXPERIMENT
Drag and drop your saved dataset onto the canvas on the right:
In the Search experiment items box on the right, type:
Clean Missing Data
then drag the module that appears onto the canvasJoin the 2 boxes by clicking the dot at the bottom of the top box and dragging to the other box
Select the bottom box and then input the following parameters on the right (here is where you can choose which method to apply for missing values e.g. replace missing with mean, or perhaps median if your column data is skewed.
Right click the bottom module and select
Run selected
Right click again and select
Cleaned dataset > Save as Dataset
The progress bar at the bottom will inform you when complete
Type in the Search experiment items box again:
convert to csv
and drag that onto the canvas and connect the left hand side bottom of the second module to the top of the newly added third:Select the bottom module and right click >
Run selected
Wait for the progress bar to complete.
Right-click the bottom module and hit
Download
. Done.I can give you a three sheet solution Sam?:
Sheet 2: Cell A1=
Paste that along the top row for each of 1000 columns in sheet 2.
Sheet 3: Cell A1=
Copy that and then paste it into the entire worksheet 3 (i.e., that top left corner symbol that allows you to do that). It's gonna take a while to update but will deliver what you want!