GregR
Footballguy
A big database (Hadoop the most common one probably) that you put a lot of your data into (so it's all accessible in one place). Also the data is in its native format rather than building a data model to put it all in. I don't even know the number of other databases we ingest from into our data lake, but it's easily several dozen, maybe a lot more than that.What exactly is a data lake? I keep hearing it mentioned in some large, very slow project at my company but nobody seems to know what they are actually building.
For instance, our data that came from Oracle, SQL Server, and other sources and is in the Data Lake now is stored in flat files. Essentially comma-delimited (though we use control characters instead of commas to avoid having to deal with escaping stuff). The data is "schema on read" meaning you just have a flat file and you come up with the table structure you want the data to be treated as being in when you read it. As opposed to "schema on write" which means you make a data model and reformat the data to fit it and be stored in it when you first write it into the data lake. There are technologies (Hive, Sparq, etc) that can let you generate the schema you want it to be treated as to make it behave more like a relational database, but without requiring the data to be stored in that form.
By not forcing it into other structures you keep the maximum amount of usability for it. You also can make it tougher to integrate data from various sources, so we kind of do a little of both. The data is in native form but then for some data we might have integration data models and transformations to make it easier to consume for users who don't need to start with it in its original form. Data scientists often want the native data though.
Last edited by a moderator: