Users would like the ability to ingest compressed objects from S3, where the objects are compressed using gzip.
Users can create an S3 source, where the S3 bucket contains objects compressed using the gzip algorithm, and Materialize can ingest those objects.
We are not planning to unify how S3 object and file objects are handled. While we will seek to reuse code where possible, unifying S3 objects and file objects is beyond the scope for this feature.
When the user creates a source, the user will be required to specify the compression algorithm
manually via the COMPRESSION
specification:
CREATE SOURCE ... FROM S3 OBJECTS ... COMPRESSION {NONE | GZIP | AUTO}
See CREATE SOURCE ... FROM
FILE for an example of how
COMPRESSION
is specified for files.
Materialize will not try to decompress any objects downloaded from this S3 source. Files that cannot be parsed, either due to being compressed or for another reason, will result in errors in the error stream for the source.
The Content-Encoding
field will be ignored for functional purposes but verified. If the
Content-Encoding
is not identity
, a debug message will be generated indicating the mismatch.
Materialize will use the gzip algorithm to decompress all objects downloaded from the S3 source. Any files that fail to decompress will result in errors in the error stream for the source.
The Content-Encoding
field will be ignored for functional purposes but verified. If the
Content-Encoding
is not gzip
, a debug message will be generated indicating the mismatch. A
mismatch on Content-Encoding
cannot be a strict error, as there are many producers out there
that will uploaded compressed data without setting the Content-Encoding
header, or the
Content-Type
header, correctly.
Materialize will use the Content-Encoding
field from object metadata to determine how to handle
the object.
The object will be treated as if it were specified with COMPRESSION NONE
.
The object will be treated as if it were specified with COMPRESSION GZIP
.
Materialize will attempt to decompress the object using gzip, and if that fails, will attempt to treat the file as uncompressed. Files that cannot be parsed, even after attempting decompression, will result in errors in the error stream for the source.
Testing will be done via testdrive. See #6048 for an example on how to test object compression with S3.
The first implementation of this feature tried to use the Content-Encoding
header to
automatically determine which compression algorithm to use. However, it appears that few
applications set the header correctly (Confluent S3
Connector
does not, for example). As such, we need to support adding flags to the source.
Detect which compression algorithm based on the Content-Type
field from the object metadata. For
example, application/x-gzip
would be a gzip compressed object and text/plain
would be an
object that is not compressed.
This solution is not ideal, as
Type is meant to specify the
media of the underyling data (such as text/csv
or text/plain
) and not the compression
algorithm applied. It should be noted that it is still possible that Content-Type
can be a media
type that needs to be decompressed (see application/x-gzip
example above).
Require that the customer provide an object (or set of objects?) that indicate how each object should handled.
Detect the compression algorithm based on the filename suffix, such as .gz
.
We could use automatic filetype detection, ala the file
unix command, to determine how to decode
the file. At least one magic crate exists.
Magic works quite well for standard filetypes but would it work sufficiently for our purposes?
AUTO
compression mechanism or defer that until later?