ClinePythonTesting
Data Pipeline & ETL Rules for Cline
Cline coding rules for Data Pipeline & ETL development. Deep, specific guidance covering architecture, patterns, and best practices.
.clinerules
# Data Pipeline & ETL Rules for Cline
# Data Pipeline & ETL Rules
## Pipeline Design
- Idempotent pipelines — re-running must produce the same result
- Incremental over full loads — never reprocess 5 years of data when you have deltas
- Checkpointing: resume from last successful point, not from the beginning
- Data lineage: track where every column came from — required for debugging and compliance
- Separate raw, staging, and transformed layers (medallion architecture: bronze/silver/gold)
## Data Quality
- Validate at every stage transition — fail fast, not at the end
- Schema enforcement: explicitly define and validate schemas (not "infer from data")
- Null handling policy defined per column — nulls are not errors unless they should be
- Row count checks: input rows vs output rows — unexplained drops are bugs
- Referential integrity checks before loading to final tables
```python
# Great Expectations or Soda for data quality
def validate_users_data(df: pd.DataFrame) -> None:
assert df['user_id'].notna().all(), "user_id cannot be null"
assert df['user_id'].is_unique, "user_id must be unique"
assert df['email'].str.contains('@').all(), "invalid email format"
assert (df['created_at'] <= pd.Timestamp.now()).all(), "future dates not allowed"
```
## Orchestration (Airflow/Prefect/Dagster)
- DAGs are code — version controlled, tested, reviewed like application code
- Task retries with exponential backoff for transient failures
- SLAs on critical pipelines — alert before business users notice
- Backfill strategy defined upfront — not figured out during an incident
- Dependencies between pipelines explicit in the orchestrator — not via cron timing
## Performance
- Partition tables by date for all time-series data — queries scan less data
- Columnar formats (Parquet, ORC) for analytical workloads — not CSV
- Cluster on join keys after partitioning — Spark/BigQuery/Snowflake specific
- Broadcast joins for small lookup tables — avoid shuffle for large tables
- Sample before aggregating when developing — don't run full dataset locallyHow to use with Cline
Create a `.clinerules` file in your project root. Cline reads this file and applies the rules to all AI-assisted coding.
#cline#python#etl#data#pipeline#airflow#ai-coding-rules
Related Rules
Python Cursor Rules
CursorPython
Best Cursor AI coding rules for Python development. Enforce type hints, PEP 8, Pythonic patterns, and modern Python best practices in your .cursorrules file.
Code Style
python · type-hintsCopy Ready
Python Cline Rules
ClinePython
Cline AI coding rules for Python: automated coding patterns and best practices for the Cline VS Code extension.
Code Style
python · clineCopy Ready
React Cline Rules
ClineReact
Cline rules for React: component patterns, performance, and modern React architecture.
Architecture
react · clineCopy Ready
Next.js Cline Rules
ClineNext.js
Cline rules for Next.js: full-stack patterns, deployment, and modern Next.js conventions.
Architecture
nextjs · clineCopy Ready