For such a large amount of data, I wouldn't store it all in a search-able datastore. I would pre-process it to extract metadata first, store that in a datastore and the rest on a filesystem. That's Google's approach with GFS, and a common practice in AWS with S3 as the filesystem and SimpleDB as the metadata store. If you don't use Amazon's platform, I would consider Riak as the metadata store. Don't use SimpleDB if all your processing is not done in AWS since network latencies and bandwidth will kill you. Don't even try to store the data itself in SimpleDB, or at least read that first: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/SDBLimits.html?r=3544 to understand why it will be a nightmare on the ops side. For the filesystem, you can use a distributed filesystem such as GlusterFS, MooseFS, Ceph or MogileFS, you could use S3 as a service if you don't query it often, or you could simply use a pool of storage nodes and use the metadata to find out which one(s) store which bit of data.