Simple Storage Service (You’ve Come a Long Way, Baby)

It’s taking a few weeks of cooking, but my Simple Storage Service is ready to come out of the oven to be eaten (and possibly spat out) by the world at large. Now there are still a couple of things on my TODO list, but nothing massive. Basically URL authentication of requests (needs some thought), postObject (I need to read the docs), virtual hosting of buckets (a lot of thought) and some tiny changes and bugs that i’ll fix over the next few days. So what has changed since my last post:

  • Anonymous requests can now be made where permission to do so has been set.
  • Authenticated/Alluser groups and ACL get and sets have been implemented.
  • All REST calls have been implemented (except postObject)*
  • Exception handling matches the S3 documentation (with some guess work)
  • The REST layer was completely rewritten using test driven development
  • phpDocumentator comments are being added to the code, so docs can be generated
  • I’ve created a web form to help you create new users to the service

So what’s next …. ? I guess I’ll polish what’s been completed so far and add some documentation to make it simpler to deploy. And as i’m off Snow boarding from Saturday i’ll wait to see what sort of feedback I get before getting started on the SOAP section which should be easier now that I’ve got a good testing setup + looking for a new job as i’ll be leaving mine soon! I’ve also found that the most popular php client for S3 (from a google search) is missing some useful functionality, so i’m pondering re-writing it and making several optimizations so it can stream downloads from S3 etc etc…

The best use for this software, apart from academic curiosity and mocking is probably a failover/backup service incase S3 goes down (which it has done). This would work best if you are CNAME record to map to s3.amazonaws.com as I believe that as this is under your DNS control it is fairly trivial to map it to another host.

Other than that I’ll write a blog on how to set it up using xampp on windows and macports on a mac (when I MacBook Pro arrives)…

You can checkout the latest code from here: http://svn.magudia.com/s3server

UPDATE: svn is broke since I moved to slicehost, you can download the code here: http://projects.magudia.com/s3server.zip

* As this service hasn’t been developed to work to meet Amazon’s data consistency model I implemented getBucketLocation, but essentially it does nothing. Although in theory I could use MySQL clustering to implement this I’m not going to unless someone wants to pay me and I also don’t have a global server network to play with 😉

Specifications

I recently read one of Joel’s blogs on still how difficult it is to reverse engineer a Microsoft Office document even though Microsoft have now released their specification’s on the formats. Now the problems I’ve been facing are in no way on the order of magnitude of any developer attempting to reverse engineer one of Microsoft’s Office documents, but as some of you may know I’ve been attempting (mostly with success – more tomorrow on that) to create a clone of the Amazon S3 service from their freely published documentation.

The problem is that it’s quite easy to replicate the ‘happy path’ of the specification as that’s been quite clearly documented, but when you try and recreate how and when different errors are thrown from just the documentation things become a little bit more murky. Say the document states that it throws different errors depending on if the Content-MD5 or the Content-Length don’t match was calculated by what was received by the service, then how do you know which will get sent first as it’s quite likely if one condition fails then the other will also fail? The specification doesn’t answer this, but my answer is that it’s probably best that it shouldn’t and these sort of questions are best left to developer forums as sometimes a specification can so detailed that no-one ever reads it!

Then today I was thinking on my way to my parents house that maybe I was wrong to create the back end database layer first and I should have stuck with a contract first approach, but later on my way home I remembered the reason I didn’t: The Amazon S3 REST service doesn’t have a contract, it has documentation – which simply isn’t the same. The S3 SOAP service does have a contract of sorts – it’s WSDL, but even that doesn’t help you recreate/describe the ‘unhappy path’ of the underlying service. The only real way you can do this is to write tests against the real service and hope they (the people who own the service) don’t change it much and your tests map out most of the potential paths which exist. Even better if the specification came with a downloadable set of software tests (JUnit et al) then that would make building a client even easier … a baseline reference implementation of sorts.

Simply contract first development works well when you own the software behind the contract and the contract itself. I’m not fully convinced it works as well when you have neither and your trying to clone a service. I could write tests against S3, but they would mean signing up and possibly breaking the T&C’s, but this project wasn’t to threaten S3 or get sued, but to understand it and the fundamental principles of well behaved web based services it bases itself on. I guess I’m someone who likes to take things apart to see how it works and that’s what I’ve done.

Also from my current experience it’s harder to develop a REST service than it is ‘in theory’ a SOAP service; BUT I think a REST service is easier to consume by clients of the service than SOAP. Simply because SOAP has massive interoperability problems between tool kits as the SOAP specification it itself ambiguous and are in small parts incompatible with several languages and REST simply has none of this because it based on the great HTTP RFC 2616 which the entire web is based on (including the majority of SOAP based services).

I have no solutions, just more questions and that generally isn’t a bad thing!

Simple Storage Service – Very Alpha Release

So after reading about the unscheduled downtime of Amazon S3 yesterday I thought that I should probably release what I’ve done so far. Although most of the work I’ve done has been focused on the storage layer and writing many many tests for it. So last night I spent a few hours hacking in functionality into what will be the REST layer of the service mostly from a PHP S3 Client to provide a very basic service to show what I’ve been doing – mostly handcrafted responses; although I’m probably going to the the pecl http extension to handle most of this in the future

This isn’t really up-to what I’d call alpha ‘quality’ in any respect, but it’s just a sneak peak with many many cavets i.e.

Anonymous authentication doesn’t work at all (you need an authenticated user for all method calls)

Only putBucket, deleteBucket, putObject, getObject, deleteObject have been partially implemented, although most methods are implemented at the storage layer.

Many many things need to be re-factored

Exception handling isn’t fully implemented yet

The REST layer has no tests and the SOAP layer hasn’t been started yet

You need the (PECL) PDO MySQL extension added to PHP (and probably some other PEAR libraries like Crypt/HMAC)

No documentation yet, but I’m willing to help with any questions

You need to be able to edit the httpd.conf for apache to enable PUT and DELETE http verbs*

If your running PHP as CGI then you may need to modify my .htaccess (well maybe?)

You need to create your own user using createUser in the storage class (but I’ll add a script into the subversion to help with this)

Security hasn’t been tested and the code is not optimized in any way

Plus some other stuff that I may have forgotten because I’m tired

You may have got the impression that I’m not entirely satisfied with this code yet and you’d be right. I’m only releasing this as *some* people *may* find it interesting. And one final thing, I don’t have a Amazon S3 account, I’ve basically cobbled this together from the documentation (which can be inconsistent), because I read the T&C’s and I wasn’t sure if Amazon would sue me if I agreed to them, so I didn’t!

Also you’ll need to create a mysql database, but the database details are hardcoded into the src/s3/lib/storage.php file and test/AbstractTest.php for unit tests.

So … blah, blah … it might not work … blah, blah … give me a break and i’ll help you ….. blah, blah …. I won’t be able to do any more work on this for one week before I start again … so here is the SVN URL ….

http://svn.magudia.com/s3server/

On the positive side of things, when I do get time next week to continue working on this project the hardest parts of the project have been thought about or have already completed, so implemented the REST and SOAP layers shouldn’t take along as I did implementing the storage layer.

* You need to modify your httpd.conf to allow PUT and DELETE http verbs by including these commands in your htdocs <DIRECTORY> tag (Apache doesn’t allow PUT or DELETE http verbs by default for sensible security reasons)

Script PUT /workspace/s3server/src/index.php

Script DELETE /workspace/s3server/src/index.php

Where the index.php matches where you (relative to your htdocs path) checked out the code.

Agile and PHP

So since my last post I’ve actually started to write my SimpleStorageService project and as I’m an agile developer I decided to write the project with the agile skills I’ve picked up over the last few years with Java, .net, scrum-master training et al and check out how easy it actually is to ‘do agile’ with PHP.

So…. where should I begin….

Unit Testing (Test Driven Development):

Firstly PHP has had unit testing for quite some time with PHPUnit; this is something which after using unit testing in Java and C# was actually quite straight forward and although there are other testing frameworks like SimpleTest I decided to go with PHPUnit as it seems more comprehensive; Although I found that SimpleTest has a better mocking implementation than PHPUnit, but for now I’m sticking with PHPUnit.

Also PHPUnit can integrate with Selenium and has a partial implementation of DbUnit, but that’s not complete yet – hopefully this will be complete by PHPUnit 4

Continuous Integration:

Now I didn’t think PHP had anything like this, so when I was looking into testing I found the phpUnderControl project which literally knocked my coding socks off as it’s a PHP wrapper for cruiseControl, but with a cool interface and extra PHP goodies on project code metrics, a Java like checkStyle which defaults to the PEAR coding standard and generating phpDoc as well as the normal cruiseControl stuff.

I was so impressed by this project that at the time (early January) I set it up on my macmini although I did have to use macports to replace the crippled default build of PHP that is bundled with OS X (please fix this Apple!). I initially installed version 0.20 of phpUnderControl, but I’m currently upgrading my install to the recently released version 0.30 which has a neat javascript metrics view – which is nice

Finally phpUnderControl neatly integrates with PHPUnit, which another reason why I’m using this and the project is now hosted alongside PHPUnit, so I hope to see more integration between the projects in the future.

Integrated Development Environment:

Allow this is by all means not needed to practise Agile, but a good IDE helps you write better code faster. I used to use DreamWeaver for all my PHP web development work, but as my SimpleStorageService is by definition a service project I didn’t need any HTML editing functionality. Anyway here was my IDE shortlist:

Ignoring DreamWeaver and TextPad as being out of date and inappropriate for the project I began with Eclipse (with PDT), but I quickly found several problems with this mainly SVN integration amongst other things. Then I gave Aptana a go which was beta at the time and did fix my SVN issues, but in the final version this was removed from the free edition (grrr!). So just when I thought that PHP didn’t have a good IDE literally stumbledupon Zend Studio Neon which ticked nearly every box I wanted from a IDE for PHP … PHPUnit, phpDoc, SVN, code coverage, code formatting, real time error checking, intellisense and much much more. The downsides are a bug where it doesn’t understand the PDO class when unit-testing (well it is beta!) and the final version isn’t free, so I’m using a time trial version which runs out in just over two weeks. Anyone want to buy me a copy 😉

Source Control:

It still surprises me how many people don’t use or even understand the point of source-control, but I’ve been a big user for many years. Firstly with CVS and then once Subversion (SVN) was more stable I moved to that and didn’t look back. I know there are many other choices here, but as SVN is integrated into phpUnderControl and Zend Studio it was simply a no brainer. My DreamHost account includes SVN so all my code can be committed ‘off site’ and I can create an abstraction between my IDE and continuous integration environment.

Conclusion:

The state of Agile in PHP is good and much much better than it was even six months ago. I think once PHPUnit 4 is released, phpUnderControl reaches stablity and Eclipse with PDT catches up with Zend Studio (add unit testing, svn projects) then Agile in PHP should be excellent and easy to accomplish. One thing I haven’t looked at is if PHP has any good scrum management products (but I guess this doesn’t necessarily have to be in PHP).

Simple Storage Service for PHP

I’ve started writing a version of Amazon S3 for PHP … Why you ask? Because I think it’s a challenge and to be honest I haven’t got anything else better to do at the moment. I guess it could be used for mocking or testing without billing?

I started a week or so ago, mostly reading the documentation for the service and figuring out in my head how I was going to write it, but i’ve only really started coding in the last few days. So far I’ve created the database structure and pseudo coded some rest and soap classes/functions… hopefully I get it up on my subversion server once it’s in a nearly use-able state for other people to play with.

I do intend this is be a almost full implementation of the service apart from maybe logging, definitly billing (obviously) and depending on my host setting up a wildcard subdomain bucket aliasing i.e. bucket.s3.magudia.com/key. Although someone to use some rewrite rules to fix this as long as thier host supports it!