Skip to content

Commit

Permalink
fix(telemetry): images
Browse files Browse the repository at this point in the history
  • Loading branch information
lmeyerov committed Oct 28, 2024
1 parent 751e9ae commit ff306d4
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions docs/tools/telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,27 +187,29 @@ To use telemetry with Graphistry, you need to:
1. **Jaeger Dashboard:** Access the [Jaeger Dashboard URL](#jaeger-dashboard).
2. **Key Tracing Information:**
- List of traces generated by the system for the graph rendering flow (for instance: show the trace list including a trace with errors).
![List of traces generated by the system for the graph rendering flow](../_static/img/jaeger-trace-list-including-trace-with-error.png)

<img alt="List of traces generated by the system for the graph rendering flow" src="../_static/img/jaeger-trace-list-including-trace-with-error.png"/>

- The root span for the graph rendering flow is `streamgl-viz: handleFalcorSocketConnections`.
- The service that generates the root span for the graph rendering flow is `streamgl-viz`.
- ETL dataset fetch spans from the Python ETL Service service.
- Detailed spans for actions by the visualization service and GPU workers (for instance: inspecting trace with error).
![Detailed spans for actions by the visualization service and GPU workers](../_static/img/jaeger-inspecting-trace-with-error.png)
<img alt="Detailed spans for actions by the visualization service and GPU workers" src="../_static/img/jaeger-inspecting-trace-with-error.png"/>

### Accessing Metrics
1. **Prometheus Dashboard:** Access the [Prometheus Dashboard URL](#prometheus-dashboard).
2. **Critical Metrics to Monitor:**
- `worker_read_crashes_total`: Monitor GPU worker crashes.
- File upload and dataset creation metrics in the Python ETL service (all metrics with the name `forge_etl_python_upload_*`, for instance: `forge_etl_python_upload_datasets_request_total`).
![File upload and dataset creation metrics in the Python ETL service](../_static/img/prometheus-forge-etl-python-metric-example.png)
<img alt="File upload and dataset creation metrics in the Python ETL service" src="../_static/img/prometheus-forge-etl-python-metric-example.png"/>

### GPU Monitoring with Grafana and NVIDIA Data Center GPU Manager

To provide comprehensive monitoring of GPU performance, we utilize Grafana in conjunction with NVIDIA Data Center GPU Manager (DCGM). These tools enable real-time visualization and analysis of GPU metrics, ensuring optimal performance and facilitating troubleshooting.
- **NVIDIA Data Center GPU Manager (DCGM):** [DCGM](https://developer.nvidia.com/dcgm) is a suite of tools for managing and monitoring NVIDIA GPUs in data centers. It provides detailed metrics on GPU performance, health, and utilization.
- **Grafana:** Grafana is an open-source platform for monitoring and observability. It allows users to query, visualize, alert on, and explore metrics from a variety of data sources, including Prometheus. By default the Grafana instance has the metrics and GPU dashboard from the `DCGM exporter` service (see `DCGM Exporter Dashboards` in the Grafana main page).

![grafana-import-dcgm-dashboard-6](../_static/img/grafana-import-dcgm-dashboard-6.png)
<img alt="grafana-import-dcgm-dashboard-6" src="../_static/img/grafana-import-dcgm-dashboard-6.png"/>

## Advanced configuration

Expand Down

0 comments on commit ff306d4

Please sign in to comment.