Jump to content

Search the Community

Showing results for tags 'logging'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOps Forum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes
    • Linux
    • Logging, Monitoring & Observability
    • Red Hat OpenShift
    • Security
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

  1. You can use Amazon Data Firehose to aggregate and deliver log events from your applications and services captured in Amazon CloudWatch Logs to your Amazon Simple Storage Service (Amazon S3) bucket and Splunk destinations, for use cases such as data analytics, security analysis, application troubleshooting etc. By default, CloudWatch Logs are delivered as gzip-compressed objects. You might want the data to be decompressed, or want logs to be delivered to Splunk, which requires decompressed data input, for application monitoring and auditing. AWS released a feature to support decompression of CloudWatch Logs in Firehose. With this new feature, you can specify an option in Firehose to decompress CloudWatch Logs. You no longer have to perform additional processing using AWS Lambda or post-processing to get decompressed logs, and can deliver decompressed data to Splunk. Additionally, you can use optional Firehose features such as record format conversion to convert CloudWatch Logs to Parquet or ORC, and dynamic partitioning to automatically group streaming records based on keys in the data (for example, by month) and deliver the grouped records to corresponding Amazon S3 prefixes. In this post, we look at how to enable the decompression feature for Splunk and Amazon S3 destinations. We start with Splunk and then Amazon S3 for new streams, then we address migration steps to take advantage of this feature and simplify your existing pipeline. Decompress CloudWatch Logs for Splunk You can use subscription filter in CloudWatch log groups to ingest data directly to Firehose or through Amazon Kinesis Data Streams. Note: For the CloudWatch Logs decompression feature, you need a HTTP Event Collector (HEC) data input created in Splunk, with indexer acknowledgement enabled and the source type. This is required to map to the right source type for the decompressed logs. When creating the HEC input, include the source type mapping (for example, aws:cloudtrail). To create a Firehose delivery stream for the decompression feature, complete the following steps: Provide your destination settings and select Raw endpoint as endpoint type. You can use a raw endpoint for the decompression feature to ingest both raw and JSON-formatted event data to Splunk. For example, VPC Flow Logs data is raw data, and AWS CloudTrail data is in JSON format. Enter the HEC token for Authentication token. To enable decompression feature, deselect Transform source records with AWS Lambda under Transform records. Select Turn on decompression and Turn on message extraction for Decompress source records from Amazon CloudWatch Logs. Select Turn on message extraction for the Splunk destination. Message extraction feature After decompression, CloudWatch Logs are in JSON format, as shown in the following figure. You can see the decompressed data has metadata information such as logGroup, logStream, and subscriptionFilters, and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs). When you enable message extraction, Firehose will extract just the contents of the message fields and concatenate the contents with a new line between them, as shown in following figure. With the CloudWatch Logs metadata filtered out with this feature, Splunk will successfully parse the actual log data and map to the source type configured in HEC token. Additionally, If you want to deliver these CloudWatch events to your Splunk destination in real time, you can use zero buffering, a new feature that was launched recently in Firehose. You can use this feature to set up 0 seconds as the buffer interval or any time interval between 0–60 seconds to deliver data to the Splunk destination in real time within seconds. With these settings, you can now seamlessly ingest decompressed CloudWatch log data into Splunk using Firehose. Decompress CloudWatch Logs for Amazon S3 The CloudWatch Logs decompression feature for an Amazon S3 destination works similar to Splunk, where you can turn off data transformation using Lambda and turn on the decompression and message extraction options. You can use the decompression feature to write the log data as a text file to the Amazon S3 destination or use with other Amazon S3 destination features like record format conversion using Parquet or ORC, or dynamic partitioning to partition the data. Dynamic partitioning with decompression For Amazon S3 destination, Firehose supports dynamic partitioning, which enables you to continuously partition streaming data by using keys within data, and then deliver the data grouped by these keys into corresponding Amazon S3 prefixes. This enables you to run high-performance, cost-efficient analytics on streaming data in Amazon S3 using services such as Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and Amazon QuickSight. Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on Amazon S3. With the new decompression feature, you can perform dynamic partitioning without any Lambda function for mapping the partitioning keys on CloudWatch Logs. You can enable the Inline parsing for JSON option, scan the decompressed log data, and select the partitioning keys. The following screenshot shows an example where inline parsing is enabled for CloudTrail log data with a partitioning schema selected for account ID and AWS Region in the CloudTrail record. Record format conversion with decompression For CloudWatch Logs data, you can use the record format conversion feature on decompressed data for Amazon S3 destination. Firehose can convert the input data format from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. You can use the features for record format conversion under the Transform and convert records settings to convert the CloudWatch log data to Parquet or ORC format. The following screenshot shows an example of record format conversion settings for Parquet format using an AWS Glue schema and table for CloudTrail log data. When the dynamic partitioning settings are configured, record format conversion works along with dynamic partitioning to create the files in the output format with a partition folder structure in the target S3 bucket. Migrate existing delivery streams for decompression If you want to migrate an existing Firehose stream that uses Lambda for decompression to this new decompression feature of Firehose, refer to the steps outlined in Enabling and disabling decompression. Pricing The Firehose decompression feature decompress the data and charges per GB of decompressed data. To understand decompression pricing, refer to Amazon Data Firehose pricing. Clean up To avoid incurring future charges, delete the resources you created in the following order: Delete the CloudWatch Logs subscription filter. Delete the Firehose delivery stream. Delete the S3 buckets. Conclusion The decompression and message extraction feature of Firehose simplifies delivery of CloudWatch Logs to Amazon S3 and Splunk destinations without requiring any code development or additional processing. For an Amazon S3 destination, you can use Parquet or ORC conversion and dynamic partitioning capabilities on decompressed data. For more information, refer to the following resources: Record Transformation and Format Conversion Enabling and disabling decompression Message extraction after decompression of CloudWatch Logs About the Authors Ranjit Kalidasan is a Senior Solutions Architect with Amazon Web Services based in Boston, Massachusetts. He is a Partner Solutions Architect helping security ISV partners co-build and co-market solutions with AWS. He brings over 25 years of experience in information technology helping global customers implement complex solutions for security and analytics. You can connect with Ranjit on LinkedIn. Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose. View the full article
  2. Amazon WorkMail now supports Audit Logging, which allows you to get insights into mailbox access patterns. Using Audit Logging, you can choose to receive authentication, access control, and mailbox access logs on Amazon CloudWatch Logs, Amazon S3, and Amazon Data Firehose. You will also receive new mailbox metrics in CloudWatch about your WorkMail organizations. View the full article
  3. Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present the design behind Logarithm, and show how it powers AI training debugging use cases. Logarithm indexes 100+GB/s of logs in real time, and thousands of queries a second. We designed the system to support service-level guarantees on log freshness, completeness, durability, query latency, and query result completeness. Users can emit logs using their choice of logging library (the common library at Meta is the Google Logging Library [glog]). Users can query using regular expressions on log lines, arbitrary metadata fields attached to logs, and across log files of hosts and services. Logarithm is written in C++20 and the codebase follows modern C++ patterns, including coroutines and async execution. This has supported both performance and maintainability, and helped the team move fast – developing Logarithm in just three years. Logarithm’s data model Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured text, corresponding to a single log file. A process can emit multiple log streams (stdout, stderr, and custom log files). Each log line can have zero or more metadata key-value pairs attached to it. A common example of metadata is rank ID in machine learning (ML) training, when multiple sequences of log lines are multiplexed into a single log stream (e.g., in PyTorch). Logarithm supports typed structures in two ways – via typed APIs (ints, floats, and strings), and extraction from a log line using regex-based parse-and-extract rules – a common example is metrics of tensors in ML model logging. The extracted key-value pairs are added to the log line’s metadata. Figure 1: Logarithm data model. The boxes on text represent typed structures. AI training debugging with Logarithm Before looking at Logarithm’s internals, we present support for training systems and model issue debugging, one of the prominent use cases of Logarithm at Meta. ML model training workflows tend to have a wide range of failure modes, spanning data inputs, model code and hyperparameters, and systems components (e.g., PyTorch, data readers, checkpointing, framework code, and hardware). Further, failure root causes evolve over time faster than traditional service architectures due to rapidly-evolving workloads, from scale to architectures to sharding and optimizations. In order to triage such a dynamic nature of failures, it is necessary to collect detailed telemetry on the systems and model telemetry. Since training jobs run for extended periods of time, training systems and model telemetry and state need to be continuously captured in order to be able to debug a failure without reproducing the failure with additional logging (which may not be deterministic and wastes GPU resources). Given the scale of training jobs, systems and model telemetry tend to be detailed and very high-throughput – logs are relatively cheap to write (e.g., compared to metrics, relational tables, and traces) and have the information content to power debugging use cases. We stream, index and query high-throughput logs from systems and model layers using Logarithm. Logarithm ingests both systems logs from the training stack, and model telemetry from training jobs that the stack executes. In our setup, each host runs multiple PyTorch ranks (processes), one per GPU, and the processes write their output streams to a single log file. Debugging distributed job failures leads to ambiguity due to lack of rank information in log lines, and adding it means that we modify all logging sites (including third-party code). With the Logarithm metadata API, process context such as rank ID is attached to every log line – the API adds it into thread-local context and attaches a glog handler. We added UI tools to enable common log-based interactive debugging primitives. The following figures show screenshots of two such features (on top of Logarithm’s filtering operations). Filter–by-callsite enables hiding known log lines or verbose/noisy logging sites when walking through a log stream. Walking through multiple log streams side-by-side enables finding rank state that is different from other ranks (i.e., additional lines or missing lines), which typically is a symptom or root cause. This is directly a result of the single program, multiple data nature of production training jobs, where every rank iterates on data batches with the same code (with batch-level barriers). Figure 2: Logarithm UI features for training systems debugging (Logs shown are for demonstration purposes). Logarithm ingests continuous model telemetry and summary statistics that span model input and output tensors, model properties (e.g., learning rate), model internal state tensors (e.g., neuron activations) and gradients during training. This powers live training model monitoring dashboards such as an internal deployment of TensorBoard, and is used by ML engineers to debug model convergence issues and training failures (due to gradient/loss explosions) using notebooks on raw telemetry. Model telemetry tends to be iteration-based tensor timeseries with dimensions (e.g., model architecture, neuron, or module names), and tends to be high-volume and high-throughput (which makes low-cost ingestion in Logarithm a natural choice). Collocating systems and model telemetry enables debugging issues that cascade from one layer to the other. The model telemetry APIs internally write timeseries and dimensions as typed key-value pairs using the Logarithm metadata API. Multimodal data (e.g., images) are captured as references to files written to an external blob store. Model telemetry dashboards typically tend to be a large number of timeseries visualizations arranged in a grid – this enables ML engineers to eyeball spatial and temporal dynamics of the model external and internal state over time and find anomalies and correlation structure. A single dashboard hence needs to get a significantly large number of timeseries and their tensors. In order to render at interactive latencies, dashboards batch and fan out queries to Logarithm using the streaming API. The streaming API returns results with random ordering, which enables dashboards to incrementally render all plots in parallel – within 100s of milliseconds to the first set of samples and within seconds to the full set of points. Figure 3: TensorBoard model telemetry dashboard powered by Logarithm. Renders 722 metric time series at once (total of 450k samples). Logarithm’s system architecture Our goal behind Logarithm is to build a highly scalable and fault tolerant system that supports high-throughput ingestion and interactive query latencies; and provides strong guarantees on availability, durability, freshness, completeness, and query latency. Figure 4: Logarithm’s system architecture. At a high level, Logarithm comprises the following components: Application processes emit logs using logging APIs. The APIs support emitting unstructured log lines along with typed metadata key-value pairs (per-line). A host-side agent discovers the format of lines and parses lines for common fields, such as timestamp, severity, process ID, and callsite. The resulting object is buffered and written to a distributed queue (for that log stream) that provides durability guarantees with days of object lifetime. Ingestion clusters read objects from queues, and support additional parsing based on any user-defined regex extraction rules – the extracted key-value pairs are written to the line’s metadata. Query clusters support interactive and bulk queries on one or more log streams with predicate filters on log text and metadata. Logarithm stores locality of data blocks in a central locality service. We implement this on a hosted, highly partitioned and replicated collection of MySQL instances. Every block that is generated at ingestion clusters is written as a set of locality rows (one for each log stream in the block) to a deterministic shard, and reads are distributed across replicas for a shard. For scalability, we do not use distributed transactions since the workload is append-only. Note that since the ingestion processing across log streams is not coordinated by design (for scalability), federated queries across log streams may not return the same last-logged timestamps between log streams. Our design choices center around layering storage, query, and log analytics and simplicity in state distribution. We design for two common properties of logs: they are written more than queried, and recent logs tend to be queried more than older ones. Design decisions Logarithm stores logs as blocks of text and metadata and maintains secondary indices to support low latency lookups on text and/or metadata. Since logs rapidly lose query likelihood with time, Logarithm tiers the storage of logs and secondary indices across physical memory, local SSD, and a remote durable and highly available blob storage service (at Meta we use Manifold). In addition to secondary indices, tiering also ensures the lowest latencies for the most accessed (recent) logs. Lightweight disaggregated secondary indices. Maintaining secondary indices on disaggregated blob storage magnifies data lookup costs at query time. Logarithm’s secondary indices are designed to be lightweight, using Bloom filters. The Bloom filters are prefetched (or loaded on-query) into a distributed cache on the query clusters when blocks are published on disaggregated storage, to hide network latencies on index lookups. We later added support for data blocks in the query cache when executing a query. The system tries to collocate data from the same log stream in order to reduce fan outs and stragglers during query processing. The logs and metadata are implemented as ORC files. The Bloom filters currently index log stream locality and metadata key-value information (i.e., min-max values and Bloom filters for each column of ORC stripes). Logarithm separates compute (ingestion and query) and storage to rapidly scale out the volume of log blocks and secondary indices. The exception to this is the in-memory memtable on ingestion clusters that buffer time-ordered lists of log streams, which is a staging area for both writes and reads. The memtable is a bounded per-log stream buffer of the most recent and long enough time window of logs that are expected to be queried. The ingestion implementation is designed to be I/O-bound and not compute or host memory bandwidth-heavy to handle close to GB/s of per-host ingestion streaming. To minimize memtable contention, we implement multiple memtables, for staging, and an immutable prior version for serializing to disk. Ingestion implementation follows zero-copy semantics. Similarly, Logarithm separates ingestion and query resources to ensure bulk processing (ingestion) and interactive workloads do not impact each other. Note that Logarithm’s design uses schema-on-write, but the data model and parsing computation is distributed between the logging hosts (which scales ingestion compute), and optionally, the ingestion clusters (for user-defined parsing). Customers can add additional anticipated capacity for storage (e.g., increased retention limits), ingestion and query workloads. Logarithm pushes down distributed state maintenance to disaggregated storage layers (instead of replicating compute at ingestion layer). The disaggregated storage in Manifold uses read-write quorums to provide strong consistency, durability and availability guarantees. The distributed queues in Scribe use LogDevice for maintaining objects as a durable replicated log. This simplifies ingestion and query tier fault tolerance. Ingestion nodes stream serialized objects on local SSDs to Manifold in 20-min. epochs, and checkpoint Scribe offsets on Manifold. When a failed ingestion node is replaced, the new node downloads the last epoch of data from Manifold, and restarts ingesting raw logs from the last Scribe checkpoint. Ingestion elasticity. The Logarithm control plane (based on Shard Manager) tracks ingestion node health and log stream shard-level hotspots, and relocates shards to other nodes when it finds issues or load. When there is an increase in logs written in a log stream, the control plane scales out the shard count and allocates new shards on ingestion nodes with available resources. The system is designed to provide resource isolation at ingestion-time between log streams. If there is a significant surge in very short timescales, the distributed queues in Scribe absorb the spikes, but when the queues are full, the log stream can lose logs (until elasticity mechanisms increase shard counts). Such spikes typically tend to result from logging bugs (e.g., verbosity) in application code. Query processing. Queries are routed randomly across the query clusters. When a query node receives a request, it assumes the role of an aggregator and partitions the request across a bounded subset of query cluster nodes (balancing between cluster load and query latency). The aggregator pushes down filter and sort operators to query nodes and returns sorted results (an end-to-end blocking operation). The query nodes read their partitions of logs by looking up locality, followed by secondary indices and data blocks – the read can span the query cache, ingestion nodes (for most recent logs) and disaggregated storage. We added 2x replication of the query cache to support query cluster load distribution and fast failover (without waiting for cache shard movement). Logarithm also provides a streaming query API with randomized and incremental sampling that returns filtered logs (an end-to-end non-blocking operation) for lower-latency reads and time-to-first-log. Logarithm paginates result sets. Logarithm can tradeoff query result completeness or ordering to maintain query latency (and flag to the client when it does so). For example, this can be the case when a partition of a query is slow or when the number of blocks to be read is too high. In the former, it times out and skips the straggler. In the latter scenario, it starts from skipped blocks (or offsets) when processing the next result page. In practice, we provide guarantees for both result completeness and query latency. This is primarily feasible since the system has mechanisms to reduce the likelihood of root causes that lead to stragglers. Logarithm also does query admission control at client or user-level. The following figures characterize Logarithm’s aggregate production performance and scalability across all log streams. They highlight scalability as a result of design choices that make the system simpler (spanning disaggregation, ingestion-query separation, indexes, and fault tolerance design). We present our production service-level objectives (SLOs) over a month, which are defined as the fraction of time they violate thresholds on availability, durability (including completeness), freshness, and query latency. Figure 5: Logarithm’s ingestion-query scalability for the month of January 2024 (one point per day). Figure 6: Logarithm SLOs for the month of January 2024 (one point per day). Logarithm supports strong security and privacy guarantees. Access control can be enforced on a per-log line granularity at ingestion and query-time. Log streams can have configurable retention windows with line-level deletion operations. Next steps Over the last few years, several use cases have been built over the foundational log primitives that Logarithm implements. Systems such as relational algebra on structured data and log analytics are being layered on top with Logarithm’s query latency guarantees – using pushdowns of search-filter-sort and federated retrieval operations. Logarithm supports a native UI for interactive log exploration, search, and filtering to aid debugging use cases. This UI is embedded as a widget in service consoles across Meta services. Logarithm also supports a CLI for bulk download of service logs for scripting analyses. The Logarithm design has centered around simplicity for scalability guarantees. We are continuously building domain-specific and agnostic log analytics capabilities within or layered on Logarithm with appropriate pushdowns for performance optimizations. We continue to invest in storage and query-time improvements, such as lightweight disaggregated inverted indices for text search, storage layouts optimized for queries and distributed debugging UI primitives for AI systems. Acknowledgements We thank Logarithm team’s current and past members, particularly our leads: Amir Alon, Stavros Harizopoulos, Rukmani Ravisundaram, and Laurynas Sukys, and our leadership: Vinay Perneti, Shah Rahman, Nikhilesh Reddy, Gautam Shanbhag, Girish Vaitheeswaran, and Yogesh Upadhay. Thank you to our partners and customers: Sergey Anpilov, Jenya (Eugene) Lee, Aravind Ram, Vikram Srivastava, and Mik Vyatskov. The post Logarithm: A logging engine for AI training workflows and services appeared first on Engineering at Meta. View the full article
  4. Sumo Logic will no longer charge a fee for ingesting log data into its observability platform to encourage DevOps teams to apply analytics more deeply.View the full article
  5. Docker Swarm is a popular container orchestration technology that makes containerized application administration easier. While Docker Swarm provides strong capabilities for deploying and scaling applications, it’s also critical to monitor and report the performance and health of your Swarm clusters. In this post, we will look at logging and monitoring in a Docker Swarm environment, as well as best practices, tools, and tactics for keeping your cluster working smoothly. The Importance of Logging and Monitoring Before we delve into the technical aspects of logging and monitoring in a Docker Swarm environment, let’s understand why these activities are crucial in a containerized setup. View the full article
  6. Amazon Web Services (AWS) is a popular cloud platform that provides a variety of services for developing, deploying, and managing applications. It is critical to develop good logging and monitoring practices while running workloads on AWS to ensure the health, security, and performance of your cloud-based infrastructure. In this post, we will look at the significance of logging and monitoring in AWS, as well as the many alternatives and best practices for logging and monitoring, as well as prominent AWS services and tools that may help you achieve these goals. The Importance of Logging and Monitoring in AWS Before we dive into the technical aspects of logging and monitoring in AWS, it’s essential to understand why these activities are critical in a cloud-based environment. View the full article
  7. Amazon CloudWatch Logs now offers customer to use Internet Protocol version 6 (IPv6) addresses for their new and existing domains. Customers moving to IPV6 can simplify their network stack by running their CloudWatch log groups on a dual-stack network that supports both IPv4 and IPv6. View the full article
  8. The Five Pillars of Red Hat OpenShift Observability It is with great pleasure that we announce additional Observability features coming up as part of the OpenShift Monitoring 4.14, Logging 5.8, and Distributed Tracing 2.9 releases. Red Hat OpenShift Observability’s plan continues to move forward: as our teams tackle key data collection, storage, delivery, visualization, and analytics features with the goal of turning your data into answers. View the full article
  9. We are excited to announce the general availability of Snowflake Event Tables for logging and tracing, an essential feature to boost application observability and supportability for Snowflake developers. In our conversations with developers over the last year, we’ve heard that monitoring and observability are paramount to effectively develop and monitor applications. But previously, developers didn’t have a centralized, straightforward way to capture application logs and traces. Enter the new Event Tables feature, which helps developers and data engineers easily instrument their code to capture and analyze logs and traces for all languages: Java, Scala, JavaScript, Python and Snowflake Scripting. With Event Tables, developers can instrument logs and traces from their UDFs, UDTFs, stored procedures, Snowflake Native Apps and Snowpark Container Services, then seamlessly route them to a secure, customer-owned Event Table. Developers can then query Event Tables to troubleshoot their applications or gain insights into performance and code behavior. Logs and traces are collected and propagated via Snowflake’s telemetry APIs, then automatically ingested into your Snowflake Event Table. Logs and traces are captured in the active Event Table for the account. Simplify troubleshooting in Native Apps Event Tables are also supported for Snowflake Native Apps. When a Snowflake Native App runs, it is running in the consumer’s account, generating telemetry data that’s ingested into their active Event Table. Once the consumer enables event sharing, new telemetry data will be ingested into both the consumer and provider Event Tables. Now the provider has the ability to debug the application that’s running in the consumer’s account. The provider only sees the telemetry data that is being shared from this data application—nothing else. For native applications, events and logs are shared with the provider only if the consumer enables Event sharing. Improve reliability across a variety of use cases You can use Event Tables to capture and analyze logs for various use cases: As a data engineer building UDFs and stored procedures within queries and tasks, you can instrument your code to analyze its behavior based on input data. As a Snowpark developer, you can instrument logs and traces for your Snowflake applications to troubleshoot and improve their performance and reliability. As a Snowflake Native App provider, you can analyze logs and traces from various consumers of your applications to troubleshoot and improve performance. Snowflake customers ranging from Capital One to phData are already using Event Tables to unlock value in their organization. “The Event Tables feature simplifies capturing logs in the observability solution we built to monitor the quality and performance of Snowflake data pipelines in Capital One Slingshot,” says Yudhish Batra, Distinguished Engineer, Capital One Software. “Event Tables has abstracted the complexity associated with logging from our data pipelines—specifically, the central Event Table gives us the ability to monitor and alert from a single location.” As phData migrates its Spark and Hadoop applications to Snowpark, the Event Tables feature has helped architects save time and hassle. “When working with Snowpark UDFs, some of the logic can become quite complex. In some instances, we had thousands of lines of Java code that needed to be monitored and debugged,” says Nick Pileggi, Principal Solutions Architect at phData. “Before Event Tables, we had almost no way to see what was happening inside the UDF and correct issues. Once we rolled out Event Tables, the amount of time we spent testing dropped significantly and allowed us to have debug and info-level access to the logs we were generating in Java.” One large communications service provider also uses logs in Event Tables to capture and analyze failed records during data ingestion from various external services to Snowflake. And a Snowflake Native App provider offering geolocation data services uses Event Tables to capture logs and traces from their UDFs to improve application reliability and performance. With Event Tables, you now have a built-in place to easily and consistently manage logging and tracing for your Snowflake applications. And in conjunction with other features such as Snowflake Alerts and Email Notifications, you can be notified of new events and errors in your applications. Try Event Tables today To learn more about Event Tables, join us at BUILD, Snowflake’s developer conference. Or get started with Event Tables today with a tutorial and quickstarts for logging and tracing. For further information about how Event Tables work, visit Snowflake product documentation. The post Collect Logs and Traces From Your Snowflake Applications With Event Tables appeared first on Snowflake. View the full article
  10. You can now run OpenSearch version 2.9 in Amazon OpenSearch Service. With OpenSearch 2.9, we have made several improvements to Search, Observability, Security analytics, and Machine Learning (ML) capabilities in OpenSearch Service. View the full article
  11. Looks like it is my turn once again to write the AWS Weekly Roundup. I wrote and published the first one on April 16, 2012 — just 4,165 short day ago! Last Week’s Launches Here are some of the launches that caught my eye last week: R7iz Instances – Optimized for high CPU performance and designed for your memory-intensive workloads, these instances are powered by the fastest 4th Generation Intel Xeon Scalable-based (Sapphire Rapids) instances in the cloud. They are available in eight sizes, with 2 to 128 vCPUs and 16 to 1024 GiB of memory, along with generous allocations of network and EBS bandwidth: vCPUs Memory (GiB) Network Bandwidth EBS Bandwidth r7iz.large 2 16 Up to 12.5 Gbps Up to 10 Gbps r7iz.xlarge 4 32 Up to 12.5 Gbps Up to 10 Gbps r7iz.2xlarge 8 64 Up to 12.5 Gbps Up to 10 Gbps r7iz.4xlarge 16 128 Up to 12.5 Gbps Up to 10 Gbps r7iz.8xlarge 32 256 12.5 Gbps 10 Gbps r7iz.12xlarge 48 384 25 Gbps 19 Gbps r7iz.16xlarge 64 512 25 Gbps 20 Gbps r7iz.32xlarge 128 1024 50 Gbps 40 Gbps As Veliswa shared in her post, the R7iz instances also include four built-in accelerators, and are available in two AWS regions. Amazon Connect APIs for View Resources – A new set of View APIs allows you to programmatically create and manage the view resources (UI templates) used in the step-by-step guides that are displayed in the agent’s UI. Daily Disbursements to Marketplace Sellers – Sellers can now set disbursement preferences and opt-in to receiving outstanding balances on a daily basis for increased flexibility, including the ability to match payments to existing accounting processes. Enhanced Error Handling for AWS Step Functions – You can now construct detailed error messages in Step Functions Fail states, and you can set a maximum limit on retry intervals. Amazon CloudWatch Logs RegEx Filtering – You can now use regular expressions in your Amazon CloudWatch Logs filter patterns. You can, for example, define a single filter that matches multiple IP subnets or HTTP status codes instead of having to use multiple filters, as was previously the case. Each filter pattern can have up to two regular expression patterns. Amazon SageMaker – There’s a new (and quick) Studio setup experience, support for Multi Model Endpoints for PyTorch, and the ability to use SageMaker’s geospatial capabilities on GPU-based instances when using Notebooks. X in Y – We launched existing services and instance types in new regions: ROSA in the Europe (Spain) Region. AWS NAT Gateway in the Los Angeles Local Zone us-west-2-lax-1a. Amazon Simple Email Service (SES) email receiving service in the US East (Ohio), Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), and Europe (Frankfurt, London) Regions. IAM roles last used and last accessed in AWS GovCloud (US) Regions. AWS Security Hub findings consolidation in AWS GovCloud (US) Regions. C6id instances in the Europe (London) Region. Amazon Location Service in the AWS GovCloud (US-West) Region. Amazon Managed Streaming for Apache Kafka and Amazon FSx for NetApp ONTAP, in the Israel (Tel Aviv) Region. M6i and R6i instances in the Europe (Zurich) Region. C6gd and R6gd instances in the AWS GovCloud (US-West) Region. VPC DNS Query Logging in the Asia Pacific (Hyderabad, Melbourne), Europe (Spain, Zurich), and Middle East (UAE) Regions. High Memory instances in the Asia Pacific (Seoul) Region. Other AWS News Here are some other AWS updates and news: AWS Fundamentals – The second edition of this awesome book, AWS for the Real World, Not for Certifications, is now available. In addition to more than 400 pages that cover 16 vital AWS services, each chapter includes a detailed and attractive infographic. Here’s a small-scale sample: More posts from AWS blogs – Here are a few posts from some of the other AWS and cloud blogs that I follow: AWS DevOps Blog – Using AWS CloudFormation and AWS Cloud Development Kit to provision multicloud resources. AWS Big Data Blog – Managing Amazon EBS volume throughput limits in Amazon OpenSearch Service domains. AWS Contact Center Blog – How contact center leaders can prepare for generative AI. AWS Desktop and Application Streaming Blog – Selecting the right AWS End User Computing service for your needs. AWS Containers Blog – Migrate existing Amazon ECS services from an internal Application Load Balancer to Amazon ECS Service Connect. AWS Community Builders – Level up your Lambda Game with Canary Deployments using SST. Cloudonaut – Self-hosted GitHub runners on AWS. Trek10 – How and When to Use Amazon EventBridge Pipes. Upcoming AWS Events Check your calendars and sign up for these AWS events: AWS End User Computing Innovation Day, Sept. 13 – The one-day virtual event is designed to help IT teams tasked with providing the tools employees need to do their jobs, especially in today’s challenging times. Learn more. AWS Global Summits, Sept. 26 – The last in-person AWS Summit will be held in Johannesburg on Sept. 26th. You can also watch on-demand videos of the latest Summit events such as Berlin, Bogotá, Paris, Seoul, Sydney, Tel Aviv, and Washington DC in the AWS YouTube channels. CDK Day, Sept. 29 – A community-led fully virtual event with tracks in English and Spanish about CDK and related projects. Learn more at the website. AWS re:Invent, Nov. 27-Dec. 1 – Ready to start planning your re:Invent? Browse the session catalog now. Join us to hear the latest from AWS, learn from experts, and connect with the global cloud community. AWS Community Days, multiple dates – Join a community-led conference run by AWS user group leaders in your region: Munich (Sept. 14), Argentina (Sept. 16), Spain (Sept. 23), Peru (Sept. 30), and Chile (Sept. 30). Visit the landing page to check out all the upcoming AWS Community Days. You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay. — Jeff; This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS! View the full article
  12. In today’s fast-paced digital landscape, containerization has become the norm, and Kubernetes has emerged as the de facto standard for container orchestration. However, with the increasing complexity of Kubernetes deployments, it has become more critical than ever to monitor and secure those environments. View the full article
  13. Amazon OpenSearch Service, with the availability of OpenSearch 1.3., now gives customers the ability to organize their logs, traces and visualizations in an application-centric view. Customers can also benefit from enhanced log monitoring support with live tailing of logs, the ability to see surrounding log data, and the ability to do powerful ad-hoc analysis of unformatted log data at query time. View the full article
  14. AWS Control Tower now includes AWS CloudTrail organization logging as part of landing zone version 3.0. With this new feature, an organization-level AWS CloudTrail trail will be deployed in your organization’s management account to automatically log the actions of all member accounts in your organizations. AWS Control Tower does not configure any parameters for logging other than a mandatory detective guardrail that checks logging is configured for all AWS Control Tower governed accounts. AWS Control Tower with organization logging offers users the latest standard and best practice for unified account logging. View the full article
  15. Starting today, Amazon VPC Flow Logs adds support for Transit Gateway. With this feature, Transit Gateway can export detailed telemetry information such as source/destination IP addresses, ports, protocol, traffic counters, timestamps and various metadata for all of its network flows. This feature provides you with an AWS native tool to centrally export and inspect flow-level telemetry for all network traffic that is traversing between Amazon VPCs and your on-premises networks via your Transit Gateway. View the full article
  16. Today, we are announcing the general availability of a new feature, Log Anomaly Detection and Recommendations for Amazon DevOps Guru. As part of this feature, DevOps Guru will ingest Amazon CloudWatch Logs for AWS resources that make up your application, with Lambda being first. Logs will provide new enrichment data in an insight to enable more accurate understanding of the root cause behind an application issue, and provide more precise remediation steps. View the full article
  17. Amazon ECS now fully supports multiline logging powered by AWS for Fluent Bit for both AWS Fargate and Amazon EC2. AWS Fluent Bit is an AWS distribution of the open-source project Fluent Bit, a fast and a lightweight log forwarder. Amazon ECS users can use this feature to re-combine partial log messages produced by your containerized applications running on AWS Fargate or Amazon EC2 into a single message for easier troubleshooting and analytics. View the full article
  18. At its .conf22 event, Splunk today announced it is making it easier to both onboard data and then manage it across hybrid IT environments via the Splunk Cloud Platform. In addition, Splunk Enterprise is being extended to add support for Microsoft Azure with SmartStore for Azure to store cold data alongside existing support for Amazon […] View the full article
  19. Logging is a crucial function to monitor and provide observability and insight into the activities of an application in distributed systems like Kubernetes. We’ve curated some of the best tools to help you achieve this, alongside a simple guide on how to get started with each of them. View the full article
  20. You can now send logs from AWS Lambda functions directly to a destination of your choice by using AWS Lambda Extensions. AWS Lambda Extensions are a new way for monitoring, observability, security, and governance tools to integrate with Lambda, and today, you can use extensions that send logs to the following providers: Datadog, New Relic, Sumo Logic, Honeycomb, Lumigo, and Coralogix. View the full article
  21. Scalyr has moved to reduce the cost of observability by enabling DevOps teams to retain log and event data in S3-compatible object storage systems than use a pay-per-TB-scanned pricing model to interrogate that data. Company CEO Christine Heckart said the Scalyr Hindsight service will make it possible for IT teams to more cost-effectively store historical […] The post Scalyr Adds Hindsight Service to IT Analytics Portfolio appeared first on DevOps.com. View the full article
  22. Logz.io announced at its online ScaleUp! 2020 conference announced today it has added support for open source Jaeger distributed tracing software to its observability platform based on a curated instance of open source Elasticsearch, Logstash and Kibana (ELK) software. Logz.io CTO Jonah Kowall said support for Jaeger extends the reach of the open source platform […] The post Logz.io Embraces Jaeger for Distributed Tracing appeared first on DevOps.com. View the full article
  23. This is the first post of a 2 part series where we will set-up production grade Kubernetes logging for applications deployed in the cluster and the cluster itself. View the full article
  24. Amazon Elastic Kubernetes Service (Amazon EKS) now allows you to forward container logs from pods running on AWS Fargate to AWS services for log storage and analytics including Amazon Cloudwatch, Amazon Elasticsearch, Amazon Kinesis Data Firehose, and Amazon Kinesis Streams. View the full article
  • Forum Statistics

    42.4k
    Total Topics
    42.2k
    Total Posts
×
×
  • Create New...