What's new
Fantasy Football - Footballguys Forums

This is a sample guest message. Register a free account today to become a member! Once signed in, you'll be able to participate on this site by adding your own topics and posts, as well as connect with other members through your own private inbox!

Data Guys, Aggregate! (1 Viewer)

We have a few in house, but I'm frankly a tad less familiar with them than others. I'd answer this with two questions:

1. How many attributes are you providing the end user for the analytics/reporting aspect? Hadoop will not solve performance issues centric to ad-hoc needs to any great degree becasue there is too much variability to consider in design.

2. Who's your end user? If the user is an Analyst who's used to waiting a bit for an answer what is the intent of leveraging Hadoop? If this is someone who wants push-button I presume it would be solely for the analytics application?
1. Typically a lot - our usual approach in the past was large models where the users could select whatever dimensions they wanted in a tool like Microstrategy- we have the opportunity to scale it back but not always

2. Mix - what I've said as a strategy for now is to get all the data in Hadoop, stage data marts in Redshift for consumption by Microstrategy (or Tableau) and leave direct access to Hadoop for the analysts/data science teams using Tableau and direct access tools.

 
We have some going directly out of Hadoop, though there is a range of use cases. Reports, KPIs, etc, that don't need to be as responsive it isn't an issue. Spotfire which is our main in house analytic tool can also help with some caching apparently for some of the cases.

Have another situation though where performance is more of an issue, wanting second or sub-second responsiveness, and staging the data somewhere was looking a likely possibility.  However, we have some promising results from initial tests using Hawq instead of Hive with it. Still some work to checking to do on it though.
We have some folks evaluating Hawq - my biggest concerns are performance and the limitations we may encounter - based on some of my reading it seems there's some limitations we may encounter.

Curious why you guys went with Spotfire - cost?

 
So our company ended up going with Qlik over MicroStrategy and Tableau. I am actually excited about the possibilities from what I've been able to grasp so far. Any Qlik users out here?
We evaluated it with Tableau - we found Tableau a little easier to use and a little more down the road for Enterprise at the time (3 years ago).  I haven't kept up with Qlik's newer versions but Gartner still gives Tableau the lead and our users have been happy so we haven't really questioned the decision.  Qlik was a good tool though so you should be happy.

 
Have any of you guys looked at DOMO?  We did a pretty exhaustive eval on them - they have an interesting product.  I loved some features (built-in collaboration) and hated others (ingesting all your data in their solution).

They frame themselves as business management software and not just an analytics tool.  Anybody test it out and/or using it?

 
Curious what the name of the department is that you guys are in.  I am trying to re-organize and re-orient some of our work and I think we should be under a different heading officially.  

 
So our company ended up going with Qlik over MicroStrategy and Tableau. I am actually excited about the possibilities from what I've been able to grasp so far. Any Qlik users out here?
We use them all... along with SAS, Pentaho, OBIEE and now implementing an Hadoop Data Lake.

I support them all on the infrastructure side...

 
Last edited by a moderator:
Have any of you guys looked at DOMO?  We did a pretty exhaustive eval on them - they have an interesting product.  I loved some features (built-in collaboration) and hated others (ingesting all your data in their solution).

They frame themselves as business management software and not just an analytics tool.  Anybody test it out and/or using it?
Currently looking to replace QLik with DOMO... not sure where they are in the process.

 
We have some folks evaluating Hawq - my biggest concerns are performance and the limitations we may encounter - based on some of my reading it seems there's some limitations we may encounter.

Curious why you guys went with Spotfire - cost?
I'm not sure the reasoning for it, wasn't involved in this arena at the time. Will see if I can find out though not even sure where the decision was made.

 
Cheap mapping software? Just need a program that can overlay zip/postal codes on a map.
Tableau can do this IIRC.  Fairly sure there's a package in R that can do the same, but that's a big learning curve unless you already know how to use R.

 
Last edited by a moderator:
Tableau can do this IIRC.  Fairly sure there's a package in R that can do the same, but that's a big learning curve unless you already know how to use R.
Tableau does it out of the box. Other software I've seen asks for Lat and Long.

 
any tips for a finance guy looking to transition to an analytics role?  I majored in math and econ, have a master's in finance, have worked ~4 years in finance.  very good in excel and solid in VBA, some light usage of SQL, have taken a couple SAS workshops but there just aren't opportunities in my current role to get practice with it.  

online courses?  any particular places to look for analytics roles in my area?

 
any tips for a finance guy looking to transition to an analytics role?  I majored in math and econ, have a master's in finance, have worked ~4 years in finance.  very good in excel and solid in VBA, some light usage of SQL, have taken a couple SAS workshops but there just aren't opportunities in my current role to get practice with it.  

online courses?  any particular places to look for analytics roles in my area?
bump for the morning crowd.

 
We have any ~serious R users in this joint?

I've been working in R most days for the last four months now.  To the point where I'm moving out of the "Hey! I know how to get answers!" phase and working to really understand data structures and how to use what's being called the Tidyverse.

More or less readr, tidyr, dplyr, purrr, migrittr, ggplot, shiny, rvest and probably a couple other packages I'm forgetting.

I'd like to be able to stop dropping my R output in Excel and keep it all in R all the way through to the final product -- but that's a stretch right now.

Anyone else use/want to use these?  

 
We have any ~serious R users in this joint?

I've been working in R most days for the last four months now.  To the point where I'm moving out of the "Hey! I know how to get answers!" phase and working to really understand data structures and how to use what's being called the Tidyverse.

More or less readr, tidyr, dplyr, purrr, migrittr, ggplot, shiny, rvest and probably a couple other packages I'm forgetting.

I'd like to be able to stop dropping my R output in Excel and keep it all in R all the way through to the final product -- but that's a stretch right now.

Anyone else use/want to use these?  
:kicksrock:

 
I moved to a more strategy role a few years ago, but I was a heavy SAS, Enterprise Miner, SQL, and Tableau user prior to that.  I still use Tableau and Base SAS at times.  Tableau is pretty handy for on-the-fly data visualization when building a presentation deck.

I see a lot more R users now as well.

 
any tips for a finance guy looking to transition to an analytics role?  I majored in math and econ, have a master's in finance, have worked ~4 years in finance.  very good in excel and solid in VBA, some light usage of SQL, have taken a couple SAS workshops but there just aren't opportunities in my current role to get practice with it.  

online courses?  any particular places to look for analytics roles in my area?
still in the stages of finalizing, but it's looking like I made this happen  :excited:

 
I've got a Hive view, 6 columns something like:

ID ... DATE ... Value A ... Value B ... Value C ... Value D

Where the combo of ID and Date are unique.

For each ID I would like to retrieve for each of A/B/C/D the most recent Date that each had a value (i.e. was not NULL), and what that value was.  So I'd end up with 9 columns... ID, Date A, Value A, Date B, Value B, etc.

I could do each of A/B/C/D in separate subqueries and join them back together.  But is there a more performant way to do it, that can return what I want in a single pass through the data instead of accessing the original view 4 times?

 
Finance guy that does the TPS reports. Sounds like we are going to have the opportunity to use Tableau in our area eventually. Most of our reports are from the GL and created in Excel using an OLAP cube and SSAS. Would also like to hone my tech skills. Any recommendations appreciated.

 
GregR said:
I've got a Hive view, 6 columns something like:

ID ... DATE ... Value A ... Value B ... Value C ... Value D

Where the combo of ID and Date are unique.

For each ID I would like to retrieve for each of A/B/C/D the most recent Date that each had a value (i.e. was not NULL), and what that value was.  So I'd end up with 9 columns... ID, Date A, Value A, Date B, Value B, etc.

I could do each of A/B/C/D in separate subqueries and join them back together.  But is there a more performant way to do it, that can return what I want in a single pass through the data instead of accessing the original view 4 times?
Is the need recurring or just once? If recurring, why not set up another view?

 
cap'n grunge said:
Finance guy that does the TPS reports. Sounds like we are going to have the opportunity to use Tableau in our area eventually. Most of our reports are from the GL and created in Excel using an OLAP cube and SSAS. Would also like to hone my tech skills. Any recommendations appreciated.
MicroStrategy offers their Tableau-like local install client called Desktop free now. It was a direct shot at Tableau, but also lets anyone play with a local installed BI tool that'll give you an idea of what Tableau will be like. 

 
GregR said:
I've got a Hive view, 6 columns something like:

ID ... DATE ... Value A ... Value B ... Value C ... Value D

Where the combo of ID and Date are unique.

For each ID I would like to retrieve for each of A/B/C/D the most recent Date that each had a value (i.e. was not NULL), and what that value was.  So I'd end up with 9 columns... ID, Date A, Value A, Date B, Value B, etc.

I could do each of A/B/C/D in separate subqueries and join them back together.  But is there a more performant way to do it, that can return what I want in a single pass through the data instead of accessing the original view 4 times?
Is the need recurring or just once? If recurring, why not set up another view?
It is going to be end up being a view, yes.  I'm trying to write the query for the view.

Getting the query to work isn't the problem. I'm trying to see if there is a more performant way to do it than I would have in Oracle with a bunch of GROUP BY's to get the max dates.  Actually, I know that RANK() is more performant in Hive than GROUP BY for this.  But what I'm asking is, I was hoping there was a performance improving way that instead of reading my big data set 4 times, once to get the max date for each of A/B/C/D, that it could process all four on one pass through the data.

Or maybe it will already do that behind the scenes I don't know.  I haven't looked at an execution plan yet, was just kind of mapping out the query in my head so far.

 
Last edited by a moderator:
cap'n grunge said:
Finance guy that does the TPS reports. Sounds like we are going to have the opportunity to use Tableau in our area eventually. Most of our reports are from the GL and created in Excel using an OLAP cube and SSAS. Would also like to hone my tech skills. Any recommendations appreciated.
Tableau has nice visualization, but it doesn't sound like you'll be designing it's reports.  Much won't change for you

 
Been on Severance for a couple months now. I've been casually looking for jobs but not aggressively.  I have another 4 months but it's time to go at it full bore. 

I have Tableau skills but it's been almost 2 years since I've used them daily. Anywhere I can brush up on them a bit to jog my memory?

Also any new tools out there that people are looking for their data guys to have?

 
Last edited by a moderator:
It is going to be end up being a view, yes.  I'm trying to write the query for the view.

Getting the query to work isn't the problem. I'm trying to see if there is a more performant way to do it than I would have in Oracle with a bunch of GROUP BY's to get the max dates.  Actually, I know that RANK() is more performant in Hive than GROUP BY for this.  But what I'm asking is, I was hoping there was a performance improving way that instead of reading my big data set 4 times, once to get the max date for each of A/B/C/D, that it could process all four on one pass through the data.

Or maybe it will already do that behind the scenes I don't know.  I haven't looked at an execution plan yet, was just kind of mapping out the query in my head so far.
Could you unpivot to ID, Value Type and Date; rank by descending Date grouped by ID and Value Type; pivot back out the the top ranked rows

 
Last edited by a moderator:
What training/ classes did you do? 
I took the intro and intermediate courses for R on edx.org.  I also took a Python course that MIT offered on the same website.  My knowledge of these programs is still on the beginner side.  

I would say the biggest factors in my securing this position was a) interviewing skills b) I have a friend who works closely with the hiring manager who put in a strong recommendation for me c) I was able to sell the hiring manager on my general intelligence, ability to learn quickly, and passion for D&A

 
I took the intro and intermediate courses for R on edx.org.  I also took a Python course that MIT offered on the same website.  My knowledge of these programs is still on the beginner side.  

I would say the biggest factors in my securing this position was a) interviewing skills b) I have a friend who works closely with the hiring manager who put in a strong recommendation for me c) I was able to sell the hiring manager on my general intelligence, ability to learn quickly, and passion for D&A
That's usually 95% of the hire. Then you just hope whatever they're asking is within your realm of learning quickly.  At least that's been my experience.  It usually is as the hiring people like to make the job description overly complicated.

 
GregR said:
I've got a Hive view, 6 columns something like:

ID ... DATE ... Value A ... Value B ... Value C ... Value D

Where the combo of ID and Date are unique.

For each ID I would like to retrieve for each of A/B/C/D the most recent Date that each had a value (i.e. was not NULL), and what that value was.  So I'd end up with 9 columns... ID, Date A, Value A, Date B, Value B, etc.

I could do each of A/B/C/D in separate subqueries and join them back together.  But is there a more performant way to do it, that can return what I want in a single pass through the data instead of accessing the original view 4 times?
Is the need recurring or just once? If recurring, why not set up another view?
It is going to be end up being a view, yes.  I'm trying to write the query for the view.

Getting the query to work isn't the problem. I'm trying to see if there is a more performant way to do it than I would have in Oracle with a bunch of GROUP BY's to get the max dates.  Actually, I know that RANK() is more performant in Hive than GROUP BY for this.  But what I'm asking is, I was hoping there was a performance improving way that instead of reading my big data set 4 times, once to get the max date for each of A/B/C/D, that it could process all four on one pass through the data.

Or maybe it will already do that behind the scenes I don't know.  I haven't looked at an execution plan yet, was just kind of mapping out the query in my head so far.

Edited 9 hours ago by GregR
Have you tried using a materialized view or populating a GTT to pre-collect your data and then query that object at runtime for improved performance? 

 
Anyone looking for an analytics job in the NFL?

Current available jobs in Football Operations:

» Football Analytics Coordinator - The Houston Texans (Houston, TX)


Football Operations: Statistics

Football Analytics Coordinator - The Houston Texans (Houston, TX)
Reports to: Director of Football Information Systems

Education/Experience:

  • Bachelor’s or master’s degree in Data Science, Analytics or other analytical field preferred.
  • Two (2) years of analytical and technical experience.
  • High-level proficiency in statistical programming languages R and/or Python.
Skills Required:

  • Demonstrated strong mathematical and computational acumen.
  • Experience in writing SQL queries and reports utilizing SQL Server 2008 or later strongly preferred.
  • Working knowledge in data discovery and new data acquisition.
  • Working knowledge of scripting languages and of working with large data sets.
  • Demonstrated working knowledge of statistics and commonly used statistics and analytical tools, econometrics, data visualization and football analytics.
  • Interest in football and familiarity with football terminology.
  • Proficiency in use of all Microsoft Office software applications with high-level proficiency in Excel.
  • Strong organizational and time management skills with ability to prioritize and manage multiple tasks in a high-energy environment.
  • Effective verbal, presentation and written communication skills.
  • Strong interpersonal skills and the ability to create and maintain solid working relationships at all levels across the organization.
  • Possess excellent attention to detail and an ability to produce high-quality, accurate work within designated deadlines.
  • Ability to maintain confidential and/or proprietary information.
  • Ability and internal drive to demonstrate a winning attitude and a strong work ethic in the performance of all job responsibilities.
Basic Function:

  • Responsible for providing direct analytical support to Football Operations personnel.
  • Provide Football Operations users with easily digestible summaries of internal data warehouse information. 
  • Develop new methods/tools to answer business intelligence questions for sports science, football operations and coaching departments. 
Job Function (duties and responsibilities):

  • Conduct ad-hoc research and create analytical reports for sports science, football operations and coaches.
  • Import, analyze, verify, and draw useful conclusions from non-documented data.
  • Perform data ETL technique (extract, transform and load) and quality control work.
  • Create integrations with sports science third-party application program interface (API).
  • Research, recommend and implement business intelligence solutions with the goal of making data more user friendly for a variety of internal clients.
  • Formulate creative and insightful internal metrics to gauge a variety of football data points.
  • Perform various other tasks that may be assigned from time to time by Director of Football Information Systems and the General Manager and Executive Vice President, Football Operations.
  • Position requires routine face-to-face personal interaction with other Club personnel; therefore, job responsibilities must be physically performed in the Club offices and not in a telecommuting manner.
Travel Requirements:

Domestic U.S. travel associated with team road games and Training Camp practices as may be requested or required.



Note: When you apply for this job online, you will be required to answer the following questions:


1. Do you have Bachelors’ and/or Master’s degree in data science, analytics, applied mathematics, or other analytical field?
2. Are you proficient with Microsoft SQL 2008 or later?
3. Are you proficient with statistical programming languages R and/or Python?

 
Any machine learning experts here? I have a project to take text descriptions of products/goods and classify them into a hierarchical coding structure. At the last minute, I was able to get an intern this Summer with the skills to create a model. He did a great job, however the results aren't good enough yet and the biggest reason is likely that we didn't do the proper work to give him a good enough training set. We simply didn't have the time to do that before the internship started, but I wasn't going to turn down free help. At the most detailed level of the product structure, he's only getting a 48% match rate with the test data. At the highest level, he's up to 65%.

The data is survey response data, so we know it can be messy. We took some quick, easy steps to eliminate data we were confident was bad and then made some assumptions about what data is good to create the training set. We learned a lot of bad data remained in the training set.

My goal right now is to get a sense of what's involved with creating an adequate training set so I can get a cost estimate and try to fund that task. Then we'll look to get someone else to pick up the interns program/model with a better training set and hopefully see better results. My first thought is to take our current training set (230,000 rows) and have people go through, row by row, to either verify or correct the text/code combinations. Also need to figure out how to deal with the misspellings, abbreviations, and other nonsense that comes along with survey response data. We tried a spell correction program, but it didn't improve the model much. And, I'm thinking maybe we shouldn't correct spelling because misspellings are a reality of the data and common misspellings are, I assume, just as good of predictors as properly spelled words (it's not like the model knows or cares about proper spelling; it only cares about correlation).

If anyone has knowledge in this area, I'd love to hear your thoughts and experiences. Or, if anyone knows of some good online training about creating training sets, that would be great. I can expand more, if needed, about the data; just ask questions so I know what's relevant to share rather than rambling on about irrelevant stuff.

 
not sure this is the right place for this, but what the hell.  

i am looking for a program (or code suggestion) that can re-calculate financial histories.  there could be hundreds of transactions with charges, penalties, interest and payments.  the penalties and interest were assessed at specific rates and the penalty and interest amounts assessed depended on the running balance of the various components at the time.  problem is that the rates need to be changed retroactively, which is really annoying to deal with from because all the payments have to be re-allocated in accordance with the adjusted rate.  so i am trying to figure out a more automated way than individually calculating all of the transactions.

any thoughts?  does quickbooks or something similar have a function where you can do that (especially if you can import).

 
not sure this is the right place for this, but what the hell.  

i am looking for a program (or code suggestion) that can re-calculate financial histories.  there could be hundreds of transactions with charges, penalties, interest and payments.  the penalties and interest were assessed at specific rates and the penalty and interest amounts assessed depended on the running balance of the various components at the time.  problem is that the rates need to be changed retroactively, which is really annoying to deal with from because all the payments have to be re-allocated in accordance with the adjusted rate.  so i am trying to figure out a more automated way than individually calculating all of the transactions.

any thoughts?  does quickbooks or something similar have a function where you can do that (especially if you can import).
Not 100% sure what you're looking for, but this sounds like something that would need to be coded.

 
So I've been working with what amounts to our data lake infrastructure and ingestion group.  It's mainly a keep it running and create data pipelines group.  I'm more of a data guy though and as one of the only people on the team with much domain knowledge I'd kind of carved out a niche working projects where I was a technical interface with the business users doing transformations and data integration though it's not our focus. I am going to try to make a change back to a more data centric role though as I still often feel like a data fish in a pond of CS majors.  We are really growing our data analytics positions so I'm tooling up for that.   Taking some Python courses at present via Coursera that I can get for the first couple for free via work.

Anyone built up their statistics skills for data science, and have any recommendations there?

 
So I've been working with what amounts to our data lake infrastructure and ingestion group.  It's mainly a keep it running and create data pipelines group.  I'm more of a data guy though and as one of the only people on the team with much domain knowledge I'd kind of carved out a niche working projects where I was a technical interface with the business users doing transformations and data integration though it's not our focus. I am going to try to make a change back to a more data centric role though as I still often feel like a data fish in a pond of CS majors.  We are really growing our data analytics positions so I'm tooling up for that.   Taking some Python courses at present via Coursera that I can get for the first couple for free via work.

Anyone built up their statistics skills for data science, and have any recommendations there?
What exactly is a data lake?  I keep hearing it mentioned in some large, very slow project at my company but nobody seems to know what they are actually building.

 

Users who are viewing this thread

Back
Top