Load S3 Data into AWS SageMaker Notebook

I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.

I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?

Thanks in advance for any advice.

标签： python amazon-web-services amazon-s3 machine-learning amazon-sagemaker

6条回答

一夜七次

2楼-- · 2020-01-29 06:06

In the simplest case you don't need boto3, because you just read resources.
Then it's even simpler:

import pandas as pd

bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

But as Prateek stated make sure to configure your SageMaker notebook instance. to have access to s3. This is done at configuration step in Permissions > IAM role

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2020-01-29 06:09

Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.

0人赞添加讨论(0) 举报

别忘想泡老子

4楼-- · 2020-01-29 06:14

You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:

import awswrangler as wr

df = wr.pandas.read_csv(path="s3://...")

0人赞添加讨论(0) 举报

小情绪 Triste *

5楼-- · 2020-01-29 06:23

If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.

0人赞添加讨论(0) 举报

够拽才男人

6楼-- · 2020-01-29 06:27

You could also access your bucket as your file system using s3fs

import s3fs
fs = s3fs.S3FileSystem()

# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]

# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
    display(Image.open(f))

0人赞添加讨论(0) 举报

贪生不怕死

7楼-- · 2020-01-29 06:28

import boto3
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

0人赞添加讨论(0) 举报

Load S3 Data into AWS SageMaker Notebook

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间