DataBrew
Definition
AWS DataBrew is a visual data preparation service that enables users to clean, normalize, and prepare data for analysis without writing any code.
Use Cases
- Amazon: Preparing and standardizing product and operational datasets for analytics by cleaning inconsistent fields and formatting issues before downstream reporting. — Teams can use AWS Glue DataBrew to profile datasets in Amazon S3, apply visual transformations (e.g., trimming whitespace, standardizing date formats, splitting columns), and write cleaned outputs back to S3 for querying with Amazon Athena or loading into Amazon Redshift. (Faster time-to-analysis by reducing manual spreadsheet work and improving data consistency for dashboards and ad-hoc queries.)
- Netflix: Cleaning and normalizing event/log-derived datasets used for internal analytics, such as ensuring consistent schemas and handling missing or malformed values. — A common pattern is to land raw data in object storage, run a visual/interactive preparation step for quick iteration on cleaning rules, and then publish curated datasets for analytics tools and batch pipelines. In AWS environments, this can be done with DataBrew producing curated outputs to S3 for query engines and ETL jobs. (Improved data quality and reduced analyst time spent on repetitive cleanup, enabling more reliable metrics and faster iteration.)
Provider Equivalents
- AWS: AWS Glue DataBrew
- Azure: Microsoft Fabric Data Wrangler (Power Query)
- GCP: Cloud Dataprep by Trifacta (legacy) / Dataplex Data Preparation (where available)
- OCI: OCI Data Integration (data preparation via mappings/transformations)
Frequently Asked Questions
- What's the difference between AWS Glue DataBrew and AWS Glue (ETL)?
- DataBrew is a visual, no-code tool for exploring, profiling, and cleaning data using point-and-click transformations. AWS Glue (ETL) is a broader service for building and running scalable ETL jobs (often code-based with Spark or script-based) and managing a data catalog. Use DataBrew for interactive data prep and quick cleaning; use Glue ETL for production pipelines, complex transformations, and large-scale scheduled processing.
- When should I use AWS Glue DataBrew?
- Use DataBrew when you need to quickly understand a dataset (profiling), clean messy files (duplicates, inconsistent formats, nulls), and create a repeatable set of transformations without writing code. It’s especially useful for analysts and data engineers who want to prototype cleaning steps interactively and then run them as scheduled jobs on data stored in Amazon S3.
- How much does AWS Glue DataBrew cost?
- Pricing is based on usage, primarily the time spent running DataBrew jobs and the number of interactive sessions. Your total cost also depends on related AWS resources you use (for example, Amazon S3 storage, AWS Glue Data Catalog, and any downstream services like Athena or Redshift). For exact rates and regional differences, check the AWS Glue DataBrew pricing page and estimate based on job duration, frequency, and dataset size.
Category: ai-ml
Difficulty: intermediate
Related Terms
See Also