What's new
Fantasy Football - Footballguys Forums

This is a sample guest message. Register a free account today to become a member! Once signed in, you'll be able to participate on this site by adding your own topics and posts, as well as connect with other members through your own private inbox!

Data Guys, Aggregate! (1 Viewer)

What exactly is a data lake?  I keep hearing it mentioned in some large, very slow project at my company but nobody seems to know what they are actually building.
A big database (Hadoop the most common one probably) that you put a lot of your data into (so it's all accessible in one place).  Also the data is in its native format rather than building a data model to put it all in.  I don't even know the number of other databases we ingest from into our data lake, but it's easily several dozen, maybe a lot more than that.

For instance, our data that came from Oracle, SQL Server, and other sources and is in the Data Lake now is stored in flat files.  Essentially comma-delimited (though we use control characters instead of commas to avoid having to deal with escaping stuff).   The data is "schema on read" meaning you just have a flat file and you come up with the table structure you want the data to be treated as being in when you read it.    As opposed to "schema on write" which means you make a data model and reformat the data to fit it and be stored in it when you first write it into the data lake.  There are technologies (Hive, Sparq, etc) that can let you generate the schema you want it to be treated as to make it behave more like a relational database, but without requiring the data to be stored in that form.

By not forcing it into other structures you keep the maximum amount of usability for it.  You also can make it tougher to integrate data from various sources, so we kind of do a little of both.  The data is in native form but then for some data we might have integration data models and transformations to make it easier to consume for users who don't need to start with it in its original form.  Data scientists often want the native data though.

 
Last edited by a moderator:
@GregR

When you say you're storing DBs as a flat file... is each table from each DB a separate flat file?  Or are you storing everything in a single file?

The latter seems space prohibitive, so I'm guessing it's got to be the former?

 
A big database (Hadoop the most common one probably) that you put a lot of your data into (so it's all accessible in one place).  Also the data is in its native format rather than building a data model to put it all in.  I don't even know the number of other databases we ingest from into our data lake, but it's easily several dozen, maybe a lot more than that.

For instance, our data that came from Oracle, SQL Server, and other sources and is in the Data Lake now is stored in flat files.  Essentially comma-delimited (though we use control characters instead of commas to avoid having to deal with escaping stuff).   The data is "schema on read" meaning you just have a flat file and you come up with the table structure you want the data to be treated as being in when you read it.    As opposed to "schema on write" which means you make a data model and reformat the data to fit it and be stored in it when you first write it into the data lake.  There are technologies (Hive, Sparq, etc) that can let you generate the schema you want it to be treated as to make it behave more like a relational database, but without requiring the data to be stored in that form.

By not forcing it into other structures you keep the maximum amount of usability for it.  You also can make it tougher to integrate data from various sources, so we kind of do a little of both.  The data is in native form but then for some data we might have integration data models and transformations to make it easier to consume for users who don't need to start with it in its original form.  Data scientists often want the native data though.
Very helpful and clear explanation, thank you.  

 
@GregR

 When you say you're storing DBs as a flat file... is each table from each DB a separate flat file?  Or are you storing everything in a single file?

The latter seems space prohibitive, so I'm guessing it's got to be the former?
Each table from a DB ends up as potentially multiple flat files within a folder structure.  Generally each folder would be a different table (so if you create, say, a Hive table on top you just tell Hive everything in the folder is data for that Hive table).  If we're doing a full refresh we'd generally just have one flat file for the table.  But if you're doing incremental updates then you might end up with one file for each refresh with just the changes in it.

Unstructured data (Word documents, Excel, images, sound files, etc) pretty much just gets copied into a similar folder structure as its origins.

A Data Lake can be very useful.  But it's also the kind of thing a CIO seizes on as a hot technology in doing big data and analytics, which so many companies want to get into. I think it's easy to spend a lot of time building one without having a clear vision on how to get the benefit out of it.  I think my company is probably only about in the middle of the spectrum there on getting our value out of it.  But I'm hopeful some things we're doing to build our data science experience within the company is a step in the right direction.

 
Last edited by a moderator:
The other thing I could have highlited there... a focus of the technologies is massive scale and parallel processing. Being able to handle petabytes of info and process it efficiently. Most of the enterprise relational databases weren't made for those volumes.

 
So I've been working with what amounts to our data lake infrastructure and ingestion group.  It's mainly a keep it running and create data pipelines group.  I'm more of a data guy though and as one of the only people on the team with much domain knowledge I'd kind of carved out a niche working projects where I was a technical interface with the business users doing transformations and data integration though it's not our focus. I am going to try to make a change back to a more data centric role though as I still often feel like a data fish in a pond of CS majors.  We are really growing our data analytics positions so I'm tooling up for that.   Taking some Python courses at present via Coursera that I can get for the first couple for free via work.

Anyone built up their statistics skills for data science, and have any recommendations there?
I am taking the Python For Everybody course their right now.  It's cool.  I am also using the app SoloLearn.  I like that one a lot, but I know that by not typing every part of it makes it a bit easier to go through.

 
I am taking the Python For Everybody course their right now.  It's cool.  I am also using the app SoloLearn.  I like that one a lot, but I know that by not typing every part of it makes it a bit easier to go through.
It looks like Coursera has put a nice paywall on a lot of things, but you can take this course for free here: https://www.py4e.com/

Just started it this AM

 

Users who are viewing this thread

Back
Top