# Reading Compressed Objects from S3 ## Summary Users would like the ability to ingest compressed objects from S3, where the objects are compressed using gzip. ## Goals Users can create an S3 source, where the S3 bucket contains objects compressed using the gzip algorithm, and Materialize can ingest those objects. ## Non-Goals We are not planning to unify how S3 object and file objects are handled. While we will seek to reuse code where possible, unifying S3 objects and file objects is beyond the scope for this feature. ## Description When the user creates a source, the user will be required to specify the compression algorithm manually via the `COMPRESSION` specification: CREATE SOURCE ... FROM S3 OBJECTS ... COMPRESSION {NONE | GZIP | AUTO} See [CREATE SOURCE ... FROM FILE](https://materialize.com/docs/sql/create-source/text-file/#syntax) for an example of how `COMPRESSION` is specified for files. ### Compression "NONE" Materialize will not try to decompress any objects downloaded from this S3 source. Files that cannot be parsed, either due to being compressed or for another reason, will result in errors in the error stream for the source. The `Content-Encoding` field will be ignored for functional purposes but verified. If the `Content-Encoding` is not `identity`, a debug message will be generated indicating the mismatch. ### Compression "GZIP" Materialize will use the gzip algorithm to decompress all objects downloaded from the S3 source. Any files that fail to decompress will result in errors in the error stream for the source. The `Content-Encoding` field will be ignored for functional purposes but verified. If the `Content-Encoding` is not `gzip`, a debug message will be generated indicating the mismatch. A mismatch on `Content-Encoding` cannot be a strict error, as there are many producers out there that will uploaded compressed data without setting the `Content-Encoding` header, or the `Content-Type` header, correctly. ### (Optional) Compression "AUTO" Materialize will use the `Content-Encoding` field from object metadata to determine how to handle the object. #### Content-Encoding "identity" The object will be treated as if it were specified with `COMPRESSION NONE`. #### Content-Encoding "gzip" The object will be treated as if it were specified with `COMPRESSION GZIP`. #### No Content-Encoding Specified Materialize will attempt to decompress the object using gzip, and if that fails, will attempt to treat the file as uncompressed. Files that cannot be parsed, even after attempting decompression, will result in errors in the error stream for the source. ## Additional Testing Testing will be done via testdrive. See #6048 for an example on how to test object compression with S3. ## Alternatives ### Using Content-Encoding The first implementation of this feature tried to use the `Content-Encoding` header to automatically determine which compression algorithm to use. However, it appears that few applications set the header correctly ([Confluent S3 Connector](https://github.com/confluentinc/kafka-connect-storage-cloud/blob/e2c032b7976e28bafbef594b761c905c8f46ee21/kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3OutputStream.java#L198) does not, for example). As such, we need to support adding flags to the source. ### Using Content-Type Detect which compression algorithm based on the `Content-Type` field from the object metadata. For example, `application/x-gzip` would be a gzip compressed object and `text/plain` would be an object that is not compressed. This solution is not ideal, as [Type](https://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1) is meant to specify the media of the underyling data (such as `text/csv` or `text/plain`) and not the compression algorithm applied. It should be noted that it is still possible that `Content-Type` can be a media type that needs to be decompressed (see `application/x-gzip` example above). ### Using an Object Manifest - #5502 Require that the customer provide an object (or set of objects?) that indicate how each object should handled. ### Using Object Name Suffix Detect the compression algorithm based on the filename suffix, such as `.gz`. ### Reading Compression Headers / Magic Bytes We could use automatic filetype detection, ala the `file` unix command, to determine how to decode the file. At least one [magic crate](https://docs.rs/magic/0.12.2/magic/#usage-example) exists. Magic works quite well for standard filetypes but would it work sufficiently for our purposes? ## Open questions 1. Do we try and add the `AUTO` compression mechanism or defer that until later?