Data Guys, Aggregate! (1 Viewer)

GregR · Dec 2, 2018

Slapdash said:
What exactly is a data lake? I keep hearing it mentioned in some large, very slow project at my company but nobody seems to know what they are actually building.

A big database (Hadoop the most common one probably) that you put a lot of your data into (so it's all accessible in one place). Also the data is in its native format rather than building a data model to put it all in. I don't even know the number of other databases we ingest from into our data lake, but it's easily several dozen, maybe a lot more than that.

For instance, our data that came from Oracle, SQL Server, and other sources and is in the Data Lake now is stored in flat files. Essentially comma-delimited (though we use control characters instead of commas to avoid having to deal with escaping stuff). The data is "schema on read" meaning you just have a flat file and you come up with the table structure you want the data to be treated as being in when you read it. As opposed to "schema on write" which means you make a data model and reformat the data to fit it and be stored in it when you first write it into the data lake. There are technologies (Hive, Sparq, etc) that can let you generate the schema you want it to be treated as to make it behave more like a relational database, but without requiring the data to be stored in that form.

By not forcing it into other structures you keep the maximum amount of usability for it. You also can make it tougher to integrate data from various sources, so we kind of do a little of both. The data is in native form but then for some data we might have integration data models and transformations to make it easier to consume for users who don't need to start with it in its original form. Data scientists often want the native data though.

Dinsy Ejotuz · Dec 2, 2018

@GregR

When you say you're storing DBs as a flat file... is each table from each DB a separate flat file? Or are you storing everything in a single file?

The latter seems space prohibitive, so I'm guessing it's got to be the former?

Slapdash · Dec 2, 2018

GregR said:
A big database (Hadoop the most common one probably) that you put a lot of your data into (so it's all accessible in one place). Also the data is in its native format rather than building a data model to put it all in. I don't even know the number of other databases we ingest from into our data lake, but it's easily several dozen, maybe a lot more than that.

For instance, our data that came from Oracle, SQL Server, and other sources and is in the Data Lake now is stored in flat files. Essentially comma-delimited (though we use control characters instead of commas to avoid having to deal with escaping stuff). The data is "schema on read" meaning you just have a flat file and you come up with the table structure you want the data to be treated as being in when you read it. As opposed to "schema on write" which means you make a data model and reformat the data to fit it and be stored in it when you first write it into the data lake. There are technologies (Hive, Sparq, etc) that can let you generate the schema you want it to be treated as to make it behave more like a relational database, but without requiring the data to be stored in that form.

By not forcing it into other structures you keep the maximum amount of usability for it. You also can make it tougher to integrate data from various sources, so we kind of do a little of both. The data is in native form but then for some data we might have integration data models and transformations to make it easier to consume for users who don't need to start with it in its original form. Data scientists often want the native data though.

Very helpful and clear explanation, thank you.

GregR · Dec 2, 2018

Dinsy Ejotuz said:
@GregR

When you say you're storing DBs as a flat file... is each table from each DB a separate flat file? Or are you storing everything in a single file?

The latter seems space prohibitive, so I'm guessing it's got to be the former?

Each table from a DB ends up as potentially multiple flat files within a folder structure. Generally each folder would be a different table (so if you create, say, a Hive table on top you just tell Hive everything in the folder is data for that Hive table). If we're doing a full refresh we'd generally just have one flat file for the table. But if you're doing incremental updates then you might end up with one file for each refresh with just the changes in it.

Unstructured data (Word documents, Excel, images, sound files, etc) pretty much just gets copied into a similar folder structure as its origins.

A Data Lake can be very useful. But it's also the kind of thing a CIO seizes on as a hot technology in doing big data and analytics, which so many companies want to get into. I think it's easy to spend a lot of time building one without having a clear vision on how to get the benefit out of it. I think my company is probably only about in the middle of the spectrum there on getting our value out of it. But I'm hopeful some things we're doing to build our data science experience within the company is a step in the right direction.

GregR · Dec 3, 2018

The other thing I could have highlited there... a focus of the technologies is massive scale and parallel processing. Being able to handle petabytes of info and process it efficiently. Most of the enterprise relational databases weren't made for those volumes.

LAUNCH · Dec 3, 2018

GregR said:
So I've been working with what amounts to our data lake infrastructure and ingestion group. It's mainly a keep it running and create data pipelines group. I'm more of a data guy though and as one of the only people on the team with much domain knowledge I'd kind of carved out a niche working projects where I was a technical interface with the business users doing transformations and data integration though it's not our focus. I am going to try to make a change back to a more data centric role though as I still often feel like a data fish in a pond of CS majors. We are really growing our data analytics positions so I'm tooling up for that. Taking some Python courses at present via Coursera that I can get for the first couple for free via work.

Anyone built up their statistics skills for data science, and have any recommendations there?

I am taking the Python For Everybody course their right now. It's cool. I am also using the app SoloLearn. I like that one a lot, but I know that by not typing every part of it makes it a bit easier to go through.

Desert_Power · Feb 28, 2021

LAUNCH said:
I am taking the Python For Everybody course their right now. It's cool. I am also using the app SoloLearn. I like that one a lot, but I know that by not typing every part of it makes it a bit easier to go through.

It looks like Coursera has put a nice paywall on a lot of things, but you can take this course for free here: https://www.py4e.com/

Just started it this AM

Search

Search

Data Guys, Aggregate! (1 Viewer)

GregR

Footballguy

Dinsy Ejotuz

Footballguy

Slapdash

Footballguy

GregR

Footballguy

GregR

Footballguy

LAUNCH

Footballguy

Desert_Power

Footballguy

Similar threads

Users who are viewing this thread