Search the Community
Showing results for tags 'data science'.
-
How to become a data science “unicorn”This is the first article in a larger series on “Full Stack Data Science” (FSDS). Although there are distinct roles for different aspects of a machine learning (ML) project, there is often a need for someone who can manage and implement projects end-to-end. This is what we can call a full-stack data scientist. In this article, I will introduce FSDS and discuss its 4 Hats. Photo by Amanda Jones on UnsplashWhat is a Full Stack Data Scientist?When I first learned data science (5+ years ago), data engineering and ML engineering were not as widespread as they are today. Consequently, the role of a data scientist was often more broadly defined than what we may see these days. For example, data scientists may have written ETL scripts, set up databases, performed feature engineering, trained ML models, and deployed models into production. Google trends over time for data science, data engineering, and ML engineering—screenshot from Google trends.Although it is becoming more common to split these tasks across multiple roles (e.g., data engineers, data scientists, and ML engineers), many situations still call for contributors who are well-versed in all aspects of ML model development. I call these contributors full-stack data scientists. More specifically, I see a full-stack data scientist as someone who can manage and implement an ML solution end-to-end. This involves formulating business problems, designing ML solutions, sourcing and preparing data for development, training ML models, and deploying models so their value can be realized. Why do we need them?Given the rise of specialized roles for implementing ML projects, this notion of FSDS may seem outdated. At least, that was what I thought in my first corporate data science role. These days, however, the value of learning the full tech stack is becoming increasingly obvious to me. This all started last year when I interviewed top data science freelancers from Upwork. Almost everyone I spoke to fit the full stack data scientist definition given above. This wasn’t just out of fun and curiosity but from necessity. I Spent $675.92 Talking to Top Data Scientists on Upwork — Here’s what I learned A key takeaway from these interviews was data science skills (alone) are limited in their potential business impact. To generate real-world value (that a client will pay for), building solutions end-to-end is a must. But this isn’t restricted to freelancing. Here are a few other contexts where FSDS can be beneficial An SMB (small-medium business) with only 1 dedicated resource for AI/ML projectsA lone AI/ML contributor is embedded in a business teamFounder who wants to build an ML productIndividual contributor at a large enterprise who can explore projects outside established teamsIn other words, full-stack data scientists are generalists who can see the big picture and dive into specific aspects of a project as needed. This makes them a valuable resource for any business looking to generate value via AI and machine learning. 4 Hats of FSDSWhile FSDS requires several skills, the role can be broken down into four key hats: Project Manager, Data Engineer, Data Scientist, and ML Engineer. Of course, no one can be world-class in all hats (probably). But one can certainly be above average across the board (it just takes time). Here, I’ll break down each of these hats based on my experience as a data science consultant and interviews with 27 data/ML professionals. Hat 1: Project ManagerThe key role of a project manager (IMO) is to answer 3 questions: what, why, and how. In other words, what are we building? Why are we building it? How will we do it? While it might be easy to skip over this work (and start coding), failing to put on the PM hat properly risks spending a lot of time (and money) solving the wrong problem. Or solving the right problem in an unnecessarily complex and expensive way. The starting point for this is defining the business problem. In most contexts, the full-stack data scientist isn’t solving their problem, so this requires the ability to work with stakeholders to uncover the problem's root causes. I discussed some tips on this in a previous article. Once the problem is clearly defined, one can identify how AI can solve it. This sets the target from which to work backward to estimate project costs, timelines, and requirements. Key skillsCommunication and managing relationshipsDiagnose problems and design solutionsEstimating project timelines, costs, and requirementsHat 2: Data EngineerIn the context of FSDS, data engineering is concerned with making data readily available for model development or inference (or both). Since this is inherently product-focused, the DE hat may be more limited than a typical data engineering role. More specifically, this likely won’t require optimizing data architectures for several business use cases. Instead, the focus will be on building data pipelines. This involves designing and implementing ETL (or ELT) processes for specific use cases. ETL stands for extract, transform, and load. It involves extracting data from their raw sources, transforming it into a meaningful form (e.g., data cleaning, deduplication, exception handling, feature engineering), and loading it into a database (e.g., data modeling and database design). Another important area here is data monitoring. While the details of this will depend on the specific use case, the ultimate goal is to give ongoing visibility to data pipelines via alerting systems, dashboards, or the like. Key skillsPython, SQL, CLI (e.g. bash)Data pipelines, ETL/ELT (Airflow, Docker)A cloud platform (AWS, GCP, or Azure)Hat 3: Data ScientistI define a data scientist as someone who uses data to uncover regularities in the world that can be used to drive impact. In practice, this often boils down to training a machine learning model (because computers are much better than humans at finding regularities in data). For most projects, one must switch between this Hat and Hats 1 and 2. During model development, it is common to encounter insights that require revisiting the data preparation or project scoping. For example, one might discover that an exception was not properly handled for a particular field or that the extracted fields do not have the predictive power that was assumed at the project's outset. An essential part of model training is model validation. This consists of defining performance metrics that can be used to evaluate models. Bonus points if this metric can be directly translated into a business performance metric. With a performance metric, one can programmatically experiment with and evaluate several model configurations by adjusting, for example, train-test splits, hyperparameters, predictor choice, and ML approach. If no model training is required, one may still want to compare the performance of multiple pre-trained models. Key SkillsPython (pandas/polars, sklearn, TensorFlow/PyTorch)Exploratory Data Analysis (EDA)Model Development (feature engineering, experiment tracking, hyperparameter tuning)Hat 4: ML EngineerThe final hat involves taking the ML model and turning it into an ML solution—that is, integrating the model into business workflows so its value can be realized. A simple way to do this is to containerize the model and set up an API so external systems can make inference calls. For example, the API could be connected to an internal website that allows business users to run a calculation. Some use cases, however, may not be so simple and require more sophisticated solutions. This is where an orchestration tool can help define complex workflows. For example, if the model requires monthly updates as new data become available, the whole model development process, from ETL to training to deployment, may need to be automated. Another important area of consideration is model monitoring. Like data monitoring, this involves tracking model predictions and performance over time and making them visible through automated alerts or other means. While many of these processes can run on local machines, deploying these solutions using a cloud platform is common practice. Every ML engineer (MLE) I have interviewed uses at least 1 cloud platform and recommended cloud deployments as a core skill of MLEs. Key SkillsContainerize scripts (Docker), build APIs (FastAPI)Orchestration — connecting data and ML pipelines (AirFlow)A cloud platform (AWS, GCP, or Azure)Becoming the UnicornWhile a full-stack data scientist may seem like a technical unicorn, the point (IMO) isn’t to become a guru of all aspects of the tech stack. Rather, it is to learn enough to be dangerous. In other words, it’s not about mastering everything but being able to learn anything you need to get the job done. From this perspective, I surmise that most data scientists will become “full stack” given enough time. Toward this end, here are 3 principles I am using to accelerate my personal FSDS development. Have a reason to learn new skills — e.g. build end-to-end projectsJust learn enough to be dangerousKeep things as simple as possible — i.e. don’t overengineer solutionsWhat’s next?A full-stack data scientist can manage and implement an ML solution end-to-end. While this may seem like overkill for contexts where specialized roles exist for key stages of model development, this generalist skillset is still valuable in many situations. As part of my journey toward becoming a full-stack data scientist, future articles of this series will walk through each of the 4 FSDS Hats via the end-to-end implementation of a real-world ML project. In the spirit of learning, if you feel anything is missing here, I invite you to drop a comment (they are appreciated) ResourcesConnect: My website | Book a call Socials: YouTube | LinkedIn | Twitter Support: Buy me a coffee ️ The Data Entrepreneurs The 4 Hats of a Full-Stack Data Scientist was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
-
- full-stack
- full stack
-
(and 2 more)
Tagged with:
-
Want to learn math for data science? Check out these three courses to learn linear algebra, calculus, statistics, and more.View the full article
-
- maths
- data science
-
(and 2 more)
Tagged with:
-
Are you looking to make a career in data science? Start by learning SQL with these free courses.View the full article
-
- sql
- data science
-
(and 5 more)
Tagged with:
-
This article introduces six top-notch, free data science resources ideal for aspiring data analysts, data scientists, or anyone aiming to enhance their analytical skills.View the full article
-
- data science
- free
-
(and 2 more)
Tagged with:
-
go Data Science and the Go Programming Language
KDnuggets posted a topic in Development & Programming
Northwestern’s School of Professional Studies uses Go in Its Master of Science in Data Science Program.View the full article -
This week on KDnuggets: Go from learning what large language models are to building and deploying LLM apps in 7 steps • Check this list of free books for learning Python, statistics, linear algebra, machine learning and deep learning • And much, much more! View the full article
-
- books
- data science
-
(and 1 more)
Tagged with:
-
An overview of the most sought-after skills in 2023 based on the rise of generative AI.View the full article
-
- data science
- data scientists
-
(and 1 more)
Tagged with:
-
Here are 5 trends that startups should keep an eye on ... https://www.snowflake.com/blog/five-trends-changing-startup-ecosystem/
-
- trends
- ecosystems
- (and 6 more)
-
Data scientists and machine learning engineers are often looking for tools that could ease their work. Kubeflow and MLFlow are two of the most popular open-source tools in the machine learning operations (MLOps) space. They are often considered when kickstarting a new AI/ML initiative, so comparisons between them are not surprising. This blog covers a very controversial topic, answering a question that many people from the industry have: Kubeflow vs MLFlow: Which one is better? Both products have powerful capabilities but their initial goal was very different. Kubeflow was designed as a tool for AI at scale, and MLFlow for experiment tracking. In this article, you will learn about the two solutions, including the similarities, differences, benefits and how to choose between them. Kubeflow vs MLFlow: which one is right for you? Watch our webinar What is Kubeflow? Kubeflow is an open-source end-to-end MLOps platform started by Google a couple of years ago. It runs on any CNCF-compliant Kubernetes and enables professionals to develop and deploy machine learning models. Kubeflow is a suite of tools that automates machine learning workflows, in a portable, reproducible and scalable manner. Kubeflow gives a platform to perform MLOps practices, providing tooling to: spin up a notebook do data preparation build pipelines to automate the entire ML process perform AutoML and training on top of Kubernetes. serve machine learning models using Kserve Kubeflow added KServe to the default bundle, offering a wide range of serving frameworks, such as NVIDIA Triton Inference Server are available. Whether you use Tensorflow, PyTorch, or PaddlePaddle, Kubeflow enables you to identify the best suite of parameters for getting the best model performance. Kubeflow has an end-to-end approach to handling machine learning processes on Kubernetes. It provides capabilities that help big teams also work proficiently together, using concepts like namespace isolation. Charmed Kubeflow is Canonical’s official distribution. Charmed Kubeflow facilitates faster project delivery, enables reproducibility and uses the hardware at its fullest potential. With the ability to run on any cloud, the MLOps platform is compatible with both public clouds, such as AWS or Azure, as well as private clouds. Furthermore, it is compatible with legacy HPC clusters, as well as high-end AI-dedicated hardware, such as NVIDIA’s GPUs or DGX. Charmed Kubeflow benefits from a wide range of integrations with various tools such as Prometheus and Grafana, as part of Canonical Observability Stack, Spark or NVIDIA Triton. It is a modular solution that can decompose into different applications, such that professionals can run AI at scale or at the edge. What is MLFlow? MLFlow is an open-source platform, started by DataBricks a couple of years ago. It is used for managing machine learning workflows. It has various functions, such as experiment tracking. MLFlow can be integrated within any existing MLOps process, but it can also be used to build new ones. It provides standardised packaging, to be able to reuse the models in different environments. However, the most important part is the model registry component, which can be used with different ML tools. It provides guidance on how to use machine learning workloads, without being an opinionated tool that constrains users in any manner. Charmed MLFlow is Canonical’s distribution of MLFlow. At the moment, it is available in Beta. We welcome all data scientists, machine learning engineers or AI enthusiasts to try it out and share feedback. It is a chance to become an open source contributor while simplifying your work in the industry. Kubeflow vs MLFlow Both Kubeflow and MLFlow are open source solutions designed for the machine learning landscape. They received massive support from industry leaders, as well as a striving community whose contributions are making a difference in the development of the project. The main purpose of Kubeflow and MLFlow is to create a collaborative environment for data scientists and machine learning engineers, to develop and deploy machine learning models in a scalable, portable and reproducible manner. However, comparing Kubeflow and MLFlow is like comparing apples to oranges. From the very beginning, they were designed for different purposes. The projects evolved over time and now have overlapping features. But most importantly, they have different strengths. On one hand, Kubeflow is proficient when it comes to machine learning workflow automation, using pipelines, as well as model development. On the other hand, MLFlow is great for experiment tracking and model registry. Also, from a user perspective, MLFlow requires fewer resources and is easier to deploy and use by beginners, whereas Kubeflow is a heavier solution, ideal for scaling up machine learning projects. Overall, Kubeflow and MLFlow should not be compared on a one-to-one basis. Kubeflow allows users to use Kubernetes for machine learning in a proper way and MLFlow is an agnostic platform that can be used with anything, from VSCode to JupyterLab, from SageMake to Kubeflow. The best way, if the layer underneath is Kubernetes, is to integrate Kubeflow and MLFlow and use them together. Charmed Kubeflow and Charmed MLFlow, for instance, are integrated, providing the best of both worlds. The process of getting them together is easy and smooth since we already prepared a guide for you. Kubeflow vs MLFlow: which one is right for you? Follow our guide How to choose between Kubeflow and MLFlow? Choosing between Kubeflow and MLFlow is quite simple once you understand the role of each of them. MLFlow is recommended to track machine learning models and parameters, or when data scientists or machine learning engineers deploy models into different platforms. Kubeflow is ideal when you need a pipeline engine to automate some of your workflows. It is a production-grade tool, very good for enterprises looking to scale their AI initiatives and cover the entire machine learning lifecycle within one tool and validate its integrations. Watch our webinar Future of Kubeflow and MLFlow Kubeflow and MLFlow are two of the most exciting open-source projects in the ML world. While they have overlapping features, they are best suited for different purposes and they work well when integrated. Long term, they are very likely going to evolve, with Kubeflow and MLFlow working closely in the upstream community to offer a smooth experience to the end user. MLFlow is going to stay the tool of choice for beginners. With the transition to scaled-up AI initiatives, MLFlow is also going to improve, and we’re likely to see a better-defined journey between the tools. Will they compete with each other head-to-head eventually and fulfil the same needs? Only time will tell. Start your MLOps journey with Canonical Canonical has both Charmed Kubeflow and Charmed MLFlow as part of a growing MLOps ecosystem. It offers security patching, upgrades and updates of the stack, as well as a widely integrated set of tools that goes beyond machine learning, including observability capabilities or big data tools. Canonical MLOps stack, you can be tried for free, but we also have enterprise support and managed services. If you need consultancy services, check out our 4 lanes, available in the datasheet. Get in touch for more details Learn more about Canonical MLOps Ubuntu AI publicationA guide to MLOpsAI in retail: use case, benefits, toolsHow to secure MLOps tooling? View the full article
-
- mlflow
- kubernetes
-
(and 4 more)
Tagged with:
-
GitLab launched its next major iteration, GitLab 15, starting with its first release version, 15.0, which the company said pulls together new DevOps and data science capabilities into the platform. With GitLab 15, GitLab says it provides (or soon will provide) continuous security and compliance, enterprise Agile planning, visibility and observability, workflow automation and increased […] The post GitLab Gets an Overhaul appeared first on DevOps.com. View the full article
-
- gitlab
- data science
-
(and 4 more)
Tagged with:
-
As the machine learning market matures, new tools are evolving that better match data science and machine learning teams’ needs. Vendors, both open source and private, have been quick to introduce new products that better meet requirements making it easier and faster to develop models and enable collaboration. These new offerings range from cloud-based platforms […] View the full article
-
Forum Statistics
67.4k
Total Topics65.3k
Total Posts