Parse a string of multipart data

I have a string (base64 decoded here) that looks like this:

----------------------------212550847697339237761929
Content-Disposition: form-data; name="preferred_name"; filename="file1.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE1}
----------------------------212550847697339237761929
Content-Disposition: form-data; name="to_process"; filename="file2.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE212341234}
----------------------------212550847697339237761929--

I generate this on a simple webpage that uploads a couple files to a AWS Lambda script through a PUT request with the API Gateway. It should be noted that what I get from the API Gateway is a Base64 string that I then decode into the string above.

The string above is the data that my Lambda script receives from the API gateway. What I would like to do is parse this string in order to retrieve the data contained within with Python 2.7. I've experimented with the cgi class and using the cgi.parse_multipart() method, however, I cannot find a way to convert a string to the required arguments. Any tips?

标签： python multipartform-data aws-api-gateway

2条回答

【Aperson】

2楼-- · 2019-06-15 07:50

Comment: is it robust and spec compliant?

As long as your Data follow this Preconditions:

The First line is the boundary
The Following Header is terminated with a empty Line
Each Message Part is terminated with the boundary

Comment: What if the content is binary like a JPEG stream?

This is likly to break as there are String Methodes used and reading the content is using .readline() which depends on New Line.
Therefore to decode from BASE64 and then unpack Multipart are the wrong Approach!

Comment: If there's a version reusing a common library

If you are able to provide your Data as Standard MIME Message you can use the following:

import email
msg = email.message_from_string(mimeHeader+data)
print('is_multipart:{}'.format(msg.is_multipart()))

for part in msg.walk():
    if part.get_content_maintype() == 'multipart':
        continue

    filename = part.get_filename()
    payload = part.get_payload(decode=True)
    print('{} filename:{}\n{}'.format(part.get_content_type(), filename, payload))

Output:

is_multipart:True
application/rtf filename:file1.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)
application/rtf filename:file2.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)

Question: Parse a string of multipart data

Pure Python Solution, for instance:

import re, io
with io.StringIO(data) as fh:
    parts = []
    part_line = []
    part_fname = None
    new_part = None
    robj = re.compile('.+filename=\"(.+)\"')

    while True:
        line = fh.readline()
        if not line: break

        if not new_part:
            new_part = line[:-1]

        if line.startswith(new_part):
            if part_line:
                parts.append({'filename':part_fname, 'content':''.join(part_line)})
                part_line = []

            while line and line != '\n':
                _match = robj.match(line)
                if _match: part_fname = _match.groups()[0]
                line = fh.readline()
        else:
            part_line.append(line)

for part in parts:
    print(part)

Output:

{'filename': 'file1.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)
{'filename': 'file2.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)

Tested with Python: 3.4.2

0人赞添加讨论(0) 举报

劫难

3楼-- · 2019-06-15 07:50

If you are working with an API, it is better to use json formatted data. You can use the requests module to send PUT request to the API and it will return you the response object from which you can retrieve the json data easily by using the method response.json()

0人赞添加讨论(0) 举报

Parse a string of multipart data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间