diff --git a/docs/tools/telemetry.md b/docs/tools/telemetry.md index b8709e4..c455963 100644 --- a/docs/tools/telemetry.md +++ b/docs/tools/telemetry.md @@ -187,19 +187,21 @@ To use telemetry with Graphistry, you need to: 1. **Jaeger Dashboard:** Access the [Jaeger Dashboard URL](#jaeger-dashboard). 2. **Key Tracing Information:** - List of traces generated by the system for the graph rendering flow (for instance: show the trace list including a trace with errors). -![List of traces generated by the system for the graph rendering flow](../_static/img/jaeger-trace-list-including-trace-with-error.png) + +List of traces generated by the system for the graph rendering flow + - The root span for the graph rendering flow is `streamgl-viz: handleFalcorSocketConnections`. - The service that generates the root span for the graph rendering flow is `streamgl-viz`. - ETL dataset fetch spans from the Python ETL Service service. - Detailed spans for actions by the visualization service and GPU workers (for instance: inspecting trace with error). -![Detailed spans for actions by the visualization service and GPU workers](../_static/img/jaeger-inspecting-trace-with-error.png) +Detailed spans for actions by the visualization service and GPU workers ### Accessing Metrics 1. **Prometheus Dashboard:** Access the [Prometheus Dashboard URL](#prometheus-dashboard). 2. **Critical Metrics to Monitor:** - `worker_read_crashes_total`: Monitor GPU worker crashes. - File upload and dataset creation metrics in the Python ETL service (all metrics with the name `forge_etl_python_upload_*`, for instance: `forge_etl_python_upload_datasets_request_total`). -![File upload and dataset creation metrics in the Python ETL service](../_static/img/prometheus-forge-etl-python-metric-example.png) +File upload and dataset creation metrics in the Python ETL service ### GPU Monitoring with Grafana and NVIDIA Data Center GPU Manager @@ -207,7 +209,7 @@ To provide comprehensive monitoring of GPU performance, we utilize Grafana in co - **NVIDIA Data Center GPU Manager (DCGM):** [DCGM](https://developer.nvidia.com/dcgm) is a suite of tools for managing and monitoring NVIDIA GPUs in data centers. It provides detailed metrics on GPU performance, health, and utilization. - **Grafana:** Grafana is an open-source platform for monitoring and observability. It allows users to query, visualize, alert on, and explore metrics from a variety of data sources, including Prometheus. By default the Grafana instance has the metrics and GPU dashboard from the `DCGM exporter` service (see `DCGM Exporter Dashboards` in the Grafana main page). -![grafana-import-dcgm-dashboard-6](../_static/img/grafana-import-dcgm-dashboard-6.png) +grafana-import-dcgm-dashboard-6 ## Advanced configuration