S3 storage system

S3 is a commonly used storage system.

Recommended practice:

Use s5cmd for CLI operations (examine files, simple upload/download, make bucket, ...)

Use megfile for Python operations (batch upload files, read files for dataset, ...)

AWS CLI

The official CLI tool for AWS. Not recommended! Use s5cmd instead.

Install

pip install awscli

First configure it using your AK & SK:

aws configure

It will create credentials under ~/.aws/credentials

[default]
aws_access_key_id=<default access key>
aws_secret_access_key=<default secret key>

To use a different endpoint than amazon, config an alias in your .bashrc:

alias aws="aws --endpoint-url=http://<ip>:<port>"

Usage

# list S3 buckets under your access key
aws s3 ls

# debug (e.g., to check your endpoint ip)
aws s3 ls --debug

# create (make) a bucket
aws s3 mb s3://myBucket

# delete a bucket (!)
aws s3 rb s3://myBucket

# list files in a bucket
aws s3 ls s3://myBucket

# upload local file to a bucket
aws s3 cp local_file s3://myBucket/remote_file

# download remote file from bucket
aws s3 cp s3://myBucket/remote_file local_file
aws s3 cp --recursive s3://myBucket/remote_folder local_folder

# delete remote file
aws s3 rm s3://myBucket/remote_file

s5cmd

A better and faster S3 CLI.

Install

Download binary from releases and add it to PATH.

Config secrets:

export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=xxx
export S3_ENDPOINT_URL=https://xxx

Also can use ~/.aws/credentials:

[$PROFILE_NAME]
region=xxx
aws_access_key_id=xxx
aws_secret_access_key=xxx
s3=
 endpoint_url=https://xxx
s3api=
 endpoint_url=https://xxx
 payload_signing_enabled=true

Usage

# ls all buckets
s5cmd ls

# download to local
s5cmd cp s3://bucket/file .

# upload
s5cmd cp file s3://bucket/
s5cmd cp dir s3://bucket/dir # no -r needed

# delete
s5cmd rm s3://bucket/file

# delete a bucket 
s5cmd rm s3://bucket/* # first delete all files in it, no -r needed
s5cmd rb s3://bucket

# exclude and include
s5cmd rm --exclude "*.log" --exclude "*.txt" 's3://bucket/logs/2020/*'
s5cmd cp --include "*.log" 's3://bucket/logs/2020/*' .

# du
s5cmd du --humanize s3://bucket/

# use a special profile (specified in ~/.aws/credentials)
s5cmd --profile PROFILE_NAME ls

# or a special credential file
s5cmd --credentials-file ~/.aws/credentials2 --profile PROFILE_NAME ls

Megfile

Both a CLI and Python SDK for S3 and other storage systems.

Install

pip install megfile

### configure 

# use env to config s3 access key
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=xxx
export AWS_ENDPOINT_URL=https://xxx # NOTE: different from s5cmd!

# or config with CLI (will modify ~/.aws/credentials)
megfile config s3 <AK> <SK> --endpoint-url http://xxx

### megfile also supports configure multiple s3 endpoints!

# use env var (not recommended due to capital letters)
export PROFILE1__AWS_ACCESS_KEY_ID=xxx
export PROFILE1__AWS_SECRET_ACCESS_KEY=xxx
export PROFILE1__AWS_ENDPOINT_URL=https://xxx

# config with CLI
megfile config s3 <AK> <SK> --endpoint-url http://xxx --profile-name profile1

# these named profiles can be used as `s3+profile1://bucket/file`

CLI

megfile ls s3://bucket
megfile cat s3://bucket/file
megfile cp s3://bucket/file local_file

megfile rm -r s3://bucket/folder

# named profile
megfile ls s3+profile1://bucket/file

Unfortunately you cannot perform bucket-level operations with megfile CLI:

list all buckets: use s5cmd ls.

create a new bucket: use s5cmd mb s3://new_bucket.

Python API

It supports both local paths and remote S3 paths, so you can use it to read/write files in a unified way. The best part of megfile is that it hides the dirty fail and retry logic of boto3 to save your life when network fluctuates.

The essence of s3 operations in memory is that we always work on BYTES, so we need to encode/decode text files and convert image files to/from bytes.

from megfile import smart_open, smart_glob

# glob files
files = smart_glob('s3://bucket/*.jpg')

# read image file
import cv2
with smart_open('s3://bucket/image.jpg', 'rb') as f:
    raw = f.read() # type(raw) == bytes
image = cv2.imdecode(np.frombuffer(raw, dtype=np.uint8), cv2.IMREAD_UNCHANGED).astype(np.float32) / 255

# write image file
image = (image * 255).astype(np.uint8)
with smart_open('s3://bucket/image.jpg', 'wb') as f:
    f.write(cv2.imencode('.jpg', image)[1].tobytes())

# read text file
with smart_open('s3://bucket/file.txt', 'rb') as f:
    text = f.read().decode() # decode bytes to utf-8 str

# write text file
with smart_open('s3://bucket/file.txt', 'wb') as f:
    f.write('test'.encode()) # encode str to bytes

# directly download from a URL to memory and write to s3 (no intermediate steps on local disk)
import requests
raw = requests.get('https://example.com/image.jpg').content
with smart_open('s3://bucket/image.jpg', 'wb') as f:
    f.write(raw)

# directly download a tar file to s3
url = "https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/objaverse_tar/1/19993.tar"
raw = requests.get(url).content
with smart_open('s3://bucket/1/19993.tar', 'wb') as f:
    f.write(raw)

# read this tarfile and extract an image in it
import tarfile
import numpy as np
with smart_open('s3://bucket/1/19993.tar', 'rb') as f:
    with tarfile.open(fileobj=f, mode='r') as tar: # tarfile can directly read s3 fetcher, no need for f.read()
        with tar.extractfile('19993/campos_512_v4/00036/00036.png') as img:
            image = cv2.imdecode(np.frombuffer(img.read(), dtype=np.uint8), cv2.IMREAD_UNCHANGED).astype(np.float32) / 255

# read glb mesh from s3
import trimesh
with megfile.smart_open('s3+pbss_obj://bucket/mesh.glb', 'rb') as f:
    raw = f.read() # S3 fetcher to raw bytes
    mesh = trimesh.load(file_obj=trimesh.util.wrap_as_stream(raw), file_type='glb', force='mesh')

# write mesh to s3
with megfile.smart_open('s3://bucket/mesh.ply', 'wb') as f:
    mesh.export(file_obj=f, file_type='ply')

boto3

The official Python SDK for AWS.

Don't use it.