Experiment: Unified Repository Analysis Pipeline¶
This page describes the reference experiment included with GreenMining. The Jupyter notebook
(experiment/beta/experiment.ipynb) demonstrates every feature of the library in a single,
end-to-end pipeline applied to 15 repositories.
Overview¶
- 5 cloud repositories found via GraphQL search
- 5 S2-group research repositories manually selected (SAF-Toolkit, UPISAS, experiment-runner, robot-runner, android-runner)
- Total: 10 repositories — all analyzed with the same pipeline and ALL features enabled
- Commits per repository: 50 (max)
- Min stars: 5
- Languages: 21 top programming languages
- Date range: 2020-01-01 to 2026-12-31
The experiment follows a 16-step pipeline:
| Step | Purpose | Library Features Used |
|---|---|---|
| 1-2 | Setup & Configuration | greenmining package, GSF_PATTERNS, GREEN_KEYWORDS |
| 3-4 | Data Gathering | fetch_repositories, analyze_repositories |
| 5-12 | Analysis | All analyzers, energy measurement |
| 13-15 | Visualization & Export | matplotlib, Plotly, JSON/CSV export |
Step 1: Import Libraries¶
Install and import all GreenMining modules:
%pip install greenmining[energy] --upgrade --quiet
import greenmining
from greenmining import (
fetch_repositories,
analyze_repositories,
GSF_PATTERNS,
GREEN_KEYWORDS,
is_green_aware,
get_pattern_by_keywords,
)
from greenmining.analyzers import (
StatisticalAnalyzer,
TemporalAnalyzer,
CodeDiffAnalyzer,
MetricsPowerCorrelator,
)
from greenmining.energy import get_energy_meter, CPUEnergyMeter
Step 2: Configuration¶
Shared parameters used across the entire pipeline:
| Parameter | Value | Description |
|---|---|---|
MAX_COMMITS |
50 | Commits analyzed per repository |
MIN_STARS |
5 | Minimum GitHub stars for search |
PARALLEL_WORKERS |
5 | Concurrent repository analysis |
LANGUAGES |
21 | Top programming languages |
CREATED_AFTER |
2020-01-01 |
Repository creation lower bound |
CREATED_BEFORE |
2026-12-31 |
Repository creation upper bound |
PUSHED_AFTER |
2020-01-01 |
Recent activity filter |
COMMIT_DATE_FROM |
2020-01-01 |
Commit analysis start date |
COMMIT_DATE_TO |
2026-12-31 |
Commit analysis end date |
The GitHub token is loaded from the environment or a .env file.
Step 3: Search Repositories¶
Uses fetch_repositories() with the GraphQL API v4 to find 10 mobile app repositories:
search_repos = fetch_repositories(
github_token=GITHUB_TOKEN,
max_repos=10,
min_stars=MIN_STARS,
languages=LANGUAGES,
keywords='mobile app',
created_after=CREATED_AFTER,
created_before=CREATED_BEFORE,
pushed_after=PUSHED_AFTER,
)
Found 10 mobile app repositories:
1. example/repo-1 (1000 stars, TypeScript)
2. example/repo-2 (500 stars, Dart)
...
Step 4: Analyze Repositories¶
Combines the 10 searched repositories with 5 manually selected S2-group repositories, then runs the full analysis pipeline on all of them:
# 5 manually selected repositories
manual_urls = [
'https://github.com/S2-group/green-lab',
'https://github.com/S2-group/UPISAS',
'https://github.com/S2-group/experiment-runner',
'https://github.com/S2-group/robot-runner',
'https://github.com/S2-group/android-runner',
]
# Combine all URLs
all_urls = search_urls + manual_urls # Total: 15 repositories
# Analyze ALL repositories with ALL features
raw_results = analyze_repositories(
urls=all_urls,
max_commits=MAX_COMMITS,
parallel_workers=PARALLEL_WORKERS,
output_format='dict',
energy_tracking=True,
energy_backend='auto',
method_level_analysis=True,
include_source_code=True,
github_token=GITHUB_TOKEN,
since_date=COMMIT_DATE_FROM,
to_date=COMMIT_DATE_TO,
)
results = [r.to_dict() for r in raw_results]
Library features applied per repository:
- GSF pattern detection (124 patterns, 15 categories, 332 keywords)
- Process metrics (DMM unit size, complexity, interfacing)
- Method-level analysis (per-function complexity via Lizard)
- Source code capture (before/after for each modified file)
- Energy measurement (auto-detected backend)
Step 5: Results Overview¶
Summary of the analysis across all repositories:
======================================================================
ANALYSIS SUMMARY
======================================================================
Repositories analyzed: 15
Total commits: ~700
Green-aware commits: ~200
Overall green rate: ~30%
Step 6: GSF Pattern Analysis¶
Counts pattern frequency across all commits and groups patterns by category:
# Pattern frequency across all commits
pattern_counts = {}
for commit in all_commits:
for pattern in commit.get('gsf_patterns_matched', []):
pattern_counts[pattern] = pattern_counts.get(pattern, 0) + 1
Unique patterns detected: 46
Top 10 GSF Patterns:
Pattern Count % of Commits
-----------------------------------------------------------------
Keep Request Counts Low 17 9.1%
Containerize Your Workload 13 7.0%
Use Compiled Languages 13 7.0%
...
GSF Categories (15):
ai, async, caching, cloud, code, data, database,
general, infrastructure, microservices, monitoring,
network, networking, resource, web
Step 7: Process Metrics¶
Examines DMM scores, structural complexity, and method-level data:
| Metric | Description |
|---|---|
dmm_unit_size |
Delta Maintainability Model -- size (0-1) |
dmm_unit_complexity |
Delta Maintainability Model -- complexity (0-1) |
dmm_unit_interfacing |
Delta Maintainability Model -- interfacing (0-1) |
total_nloc |
Total non-comment lines of code |
total_complexity |
Cyclomatic complexity sum |
max_complexity |
Highest single-function complexity |
methods_count |
Number of methods analyzed |
Step 8: Statistical Analysis¶
Applies statistical methods to the combined dataset:
stat_analyzer = StatisticalAnalyzer()
stat_analyzer.analyze_pattern_correlations(commits_df)
stat_analyzer.temporal_trend_analysis(commits_df)
stat_analyzer.effect_size_analysis(green_cx, non_green_cx)
| Method | Output |
|---|---|
analyze_pattern_correlations() |
Significant co-occurrence pairs |
temporal_trend_analysis() |
Trend direction, significance, correlation |
effect_size_analysis() |
Cohen's d, magnitude, mean difference |
Step 9: Temporal Analysis¶
Groups commits by quarter and tracks green awareness evolution:
temporal = TemporalAnalyzer(granularity='quarter')
temporal_results = temporal.analyze_trends(all_commits, analysis_results_fmt)
Output includes period-by-period green commit rates, unique pattern counts, overall trend direction, and peak period identification.
Step 10: Code Diff Pattern Signatures¶
Inspects the green pattern categories that CodeDiffAnalyzer detects in code diffs:
Code Diff Pattern Signatures: 15 types
caching, resource_optimization, database_optimization,
async_processing, lazy_loading, serverless_computing,
cdn_edge, compression, model_optimization,
efficient_protocols, container_optimization, green_regions,
auto_scaling, code_splitting, green_ml_training
Step 11: Energy Measurement¶
Demonstrates multiple energy measurement approaches:
| Backend | Platform | What It Measures |
|---|---|---|
| RAPL | Linux (Intel/AMD) | Hardware energy counters (Joules) |
| CPU Meter | Universal | CPU utilization + TDP estimate (Joules) |
| tracemalloc | Universal | Peak memory allocation (bytes) |
| CodeCarbon | Cross-platform | CO2 emissions (kg CO2eq) |
# RAPL (Linux Intel/AMD)
from greenmining.energy import RAPLEnergyMeter
rapl = RAPLEnergyMeter()
result, energy = rapl.measure(sample_workload)
# CPU Meter (universal)
meter = CPUEnergyMeter()
result, energy = meter.measure(sample_workload)
# CodeCarbon
from codecarbon import EmissionsTracker
tracker = EmissionsTracker()
tracker.start()
# ... workload ...
emissions = tracker.stop()
Step 12: Metrics-to-Power Correlation¶
Correlates code metrics with energy consumption:
correlator = MetricsPowerCorrelator(significance_level=0.05)
correlator.fit(metric_names, metrics_values, power_measurements)
summary = correlator.summary()
Computes Pearson and Spearman coefficients with significance testing, plus feature importance ranking across all metrics.
Step 13: Visualization (matplotlib)¶
Static 2x2 chart grid:
- Green commit rate per repository (horizontal bar)
- Top 10 GSF patterns (horizontal bar)
- Total vs green commit breakdown (grouped bar)
- Complexity distribution (histogram)
Saved to data/analysis_plots.png.
Step 14: Interactive Visualization (Plotly)¶
Interactive charts:
- Sunburst: Repository → Green/Non-Green → Pattern
- Scatter: Complexity vs Lines of Code, colored by green awareness
Step 15: Export Results¶
Exports the analysis in multiple formats:
Output Structure¶
data/
├── repositories/ # Individual repo analyses (full data)
│ ├── repo-name-1.json
│ ├── repo-name-2.json
│ └── ...
├── analysis_summary.json # All repos, no source code (~2-5 MB)
├── analysis_results.csv # Flattened commit-level (~200 KB)
└── statistics.json # High-level overview (~10 KB)
File Descriptions¶
| File | Size | Content |
|---|---|---|
repositories/*.json |
Large | Full analysis per repo (includes source code) |
analysis_summary.json |
Medium | All repos without source_changes and methods |
analysis_results.csv |
Small | Flattened commit-level data for spreadsheets |
statistics.json |
Tiny | Overall stats, top patterns, repo summary |
Feature Coverage¶
The experiment applies every library feature to every repository:
| Feature | Module | Applied |
|---|---|---|
| GSF Pattern Detection (124 patterns, 15 categories) | greenmining |
✅ |
| Process Metrics (DMM size, complexity, interfacing) | greenmining |
✅ |
| Method-Level Analysis (per-function complexity) | greenmining |
✅ |
| Source Code Capture (before/after) | greenmining |
✅ |
| Energy Measurement (RAPL + CPU Meter) | greenmining.energy |
✅ |
| Memory Profiling (tracemalloc) | tracemalloc |
✅ |
| CO2 Emissions (CodeCarbon) | codecarbon |
✅ |
| Statistical Analysis (correlations, effect sizes) | greenmining.analyzers |
✅ |
| Temporal Analysis (quarterly trends) | greenmining.analyzers |
✅ |
| Code Diff Pattern Signatures | greenmining.analyzers |
✅ |
| Metrics-to-Power Correlation (Pearson/Spearman) | greenmining.analyzers |
✅ |
| Visualization (matplotlib + Plotly) | External | ✅ |
| Export (JSON, CSV, DataFrame) | Built-in | ✅ |
Running the Experiment¶
# Install the library
pip install greenmining[energy]
# Set your GitHub token
export GITHUB_TOKEN=ghp_your_token_here
# Open the notebook
jupyter notebook experiment/beta/experiment.ipynb
Run all cells sequentially. The data gathering steps require a valid GitHub token and internet access.