You are viewing our old blog site. For latest posts, please visit us at the new space. Follow our publication there to stay updated with tech articles, tutorials, events & more.

Transferring Binary data through HTTP

0.00 avg. rating (0% score) - 0 votes
Problem Statement:

We have a http service (written in PHP) which is used for storing and fetching attachment/media files with more then 15 millions of request daily.
As it is used to handle files instead of raw data, so the data transfer is always expected to be high and will always be a concern/challenge.

Client needs both the meta-data plus the content from the service. Means an array of data has to be returned from the server end with below format:
    Array ( “name” => “name”, “filename” => “filename”,  “extension” => “doc”,  “content” => “BINARY DATA”)
Since you can not return php arrays in REST based services. This data has to first serialized before sending back to thel client.

Earlier approach:
  • JSON is being used to serialize the data at server side.
  • As the data consist of binary data which gets broken using JSON, so we first need to encode the data.
  • UTF-8  encoding is used to encode the data and then the data is serialized using JSON
  • At client side decode the JSON, decode the UTF-8 data and create PHP array and return to the app.

Problems with this approach:
  • With UTF-8 encoding data size increases (around 35%-50%, sometime even more then 100%)
  • More processing both at server and client side.
  • High network bandwidth consumption.
  • Scaling the service requires a lot more server resources than needed.
  • Since encoding of the data is involved at server end, a service client is required at app end to decode the data (encoding done by PHP in our case). Which makes is language dependent.

The solution:
We needed a protocol/mechanism to serialize the data which do not increase the size of response and do not hamper the performance.
 
For this we came across multipart/form-data’ content type approach.

What is multipart/form-data:
 
It is an encoding type that allows files to be sent through a POST. It means that no characters will be encoded. that is why this type is used while uploading files to server.  So multipart/form-data is used when a form requires binary data, like the contents of a file, to be uploaded.

We can basically use this file-upload behaviour, means replicate the behaviour how files are getting uploaded in html form.
 
 Suppose we have a html form with following fields:
  •     2 text field, id and email
  •     1 file field

When we submit the above form, we will get the below POST request(if we check it in firebug)

—————————–1306373535267935191105972702
Content-Disposition: form-data; name=”id”

1
—————————–1306373535267935191105972702
Content-Disposition: form-data; name=”email”

a@a.gmail.com
—————————–1306373535267935191105972702
Content-Disposition: form-data; name=”file”; filename=”CV.docx”
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

**binary data**
—————————–1306373535267935191105972702
Content-Disposition: form-data; name=”submit”

Submit
—————————–1306373535267935191105972702



How it works:
  • Content-Type: multipart/form-data sets the content type to multipart/form-data and boundary=—————————–1306373535267935191105972702 indicates  that the fields are separated by the boundary string.
  • Every field gets some sub-headers before its data: Content-Disposition: form-data;, the field name, the filename, followed by the data.
  • At server side data is exploded with respect to boundary string.
  • Boundary set should be unique,so no encoding of the data is necessary and binary data is sent as it is.


How did we use it:
 
We can use the above multipart/form-data protocol to transfer the binary data through HTTP services. We can send the data in above format with some unique boundary string which can be easily parsed at client side.

At server side:

Suppose we have below response array at server end:
    Array ( “name” => “name”, “filename” => “filename”,  “extension” => “doc”,  “content” => “BINARY DATA”)

We can have some unique boundary string and set it in header like:
    header(‘Content-type: multipart/form-data; boundary=—-mainhoonboundary’);

and then We can convert above data into multipart/form-data format like
 
—-mainhoonboundary
Content-Disposition: form-data; name=”name”; filename=”name”

name
—-mainhoonboundary
Content-Disposition: form-data; name=”filename”; filename=”filename”

filename
—-mainhoonboundary
Content-Disposition: form-data; name=”extension”; filename=”extension”

doc
—-mainhoonboundary
Content-Disposition: form-data; name=”content”; filename=”content”

BINARY DATA



At client side:
 
We can fetch the content type from header and if it is multipart/form-data then we can get the boundary string from it and can easily parse the multi-part data using above boundary string through regex.

What we achieved:
  • Reduction in response data size by approx 20%
  • Network bandwidth consumption reduced by approx 3.7 MBps
  • No need of any external encoding
  • No extra processing 
Posted in General