I am working on image segmentation machine learning project and I would like to test it out on Google Colab.
For the training dataset, I have 700 images, mostly 256x256
, that I need to upload into a python numpy array for my project. I also have thousands of corresponding mask files to upload. They currently exist in a variety of subfolders on Google drive, but have been unable to upload to Google Colab for use in my project.
So far I have attempted using Google Fuse which seems to have very slow upload speeds and PyDrive which has given me a variety of authentication errors. I have been using the Google Colab I/O example code for the most part.
How should I go about this? Would PyDrive be the way to go? Is there code somewhere for uploading a folder structure or many files at a time?
You can put all your data into your google drive and then mount drive. This is how I have done it. Let me explain in steps.
Step 1:
Transfer your data into your google drive.
Step 2:
Run the following code to mount you google drive.
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
# Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()
# Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
# Create a directory and mount Google Drive using that directory.
!mkdir -p My Drive
!google-drive-ocamlfuse My Drive
!ls My Drive/
# Create a file in Drive.
!echo "This newly created file will appear in your Drive file list." > My Drive/created.txt
Step 3:
Run the following line to check if you can see your desired data into mounted drive.
!ls Drive
Step 4:
Now load your data into numpy array as follows. I had my exel files having my train and cv and test data.
train_data = pd.read_excel(r'Drive/train.xlsx')
test = pd.read_excel(r'Drive/test.xlsx')
cv= pd.read_excel(r'Drive/cv.xlsx')
I hope it can help.
Edit
For downloading the data into your drive from the colab notebook environment, you can run the following code.
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Create & upload a file.
uploaded = drive.CreateFile({'data.xlsx': 'data.xlsx'})
uploaded.SetContentFile('data.xlsx')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))
Here are few steps to upload large dataset to Google Colab
1.Upload your dataset to free cloud storage like dropbox, openload, etc.(I used dropbox)
2.Create a shareable link of your uploaded file and copy it.
3.Open your notebook in Google Colab and run this command in one of the cell:
!wget your_shareable_file_link
That's it!
You can compress your dataset in zip or rar file and later unizp it after downloading it in Google Colab by using this command:
!unzip downloaded_filename -d destination_folder
Zip you file first then upload it to Google Drive.
See this simple command to unzip:
!unzip {file_location}
Example:
!unzip drive/models.rar
Step1: Mount the Drive, by running the following command:
from google.colab import drive
drive.mount('/content/drive')
This will output a link. Click on the link, hit allow, copy the authorization code and paste it the box present in colab cell with the text "Enter your authorization code:" written on top of it.
This process is just giving permission for colab to access your Google Drive.
Step2: Upload your folder(zipped or unzipped depending on the size of the folder) to Google Drive
Step3: Now work your way into the Drive directories and files to locate your uploaded folder/zipped file.
This process may look something like this:
The current working directory in colab when you start off will be /content/
Just to make sure, run the following command in the cell:
!pwd
It will show you the current directory you are in. (pwd stands for "print working directory")
Then use the commands like:
!ls
to list the directories and files in the directory you are in
and the command:
!cd /directory/name/of/your/choice
to move into the directories to locate your uploaded folder or the uploaded .zip file.
And just like that, you are ready to get your hands dirty with your Machine Learning model! :)
Hopefully, these simple steps will prevent you from spending too much unnecessary time on figuring out how colab works when you should actually be spending the majority of your time figuring out the Machine learning model, its hyperparameters, pre-processing...
You may want to try the kaggle-cli
module, as discussed here