Intro
1 Identity
2 Privilege
3 Consistency
4 Security Ops
5 Observability
6 Documentation
7 Automation
Close
Architecture Philosophy
Cloud Architecture Operating Principles
These principles guide how I design, assess, and operate modern Microsoft cloud environments. They reflect 20+ years of lessons from enterprise infrastructure, hybrid identity, and multi-tenant deployments.
Click any number to jump directly to that principle →
Daniel Lepel
Principal Microsoft Cloud Architect
1
Identity
2
Least Privilege
3
Consistency
4
Security Ops
5
Observability
6
Documentation
7
Automation
Principle One
Identity Is the Security Boundary
Zero Trust Foundation
  • Centralized identity through Microsoft Entra ID
    Microsoft Entra ID
    Microsoft's cloud identity platform (formerly Azure Active Directory) - manages who can sign in, what they can access, and under what conditions.
    Why it matters
    When identity is fragmented across systems, there is no single place to enforce policy or detect compromise. Centralizing in Entra ID gives you one control plane for every access decision across M365, Azure, and connected applications.
  • Privileged access with just-in-time elevation
    Why it matters
    Persistent admin accounts are a primary attack target. Just-in-time elevation through PIM
    Privileged Identity Management
    Grants admin access only when needed, for a defined time window, with optional approval - then removes it automatically.
    means there is nothing standing to compromise outside of an active session.
  • Strong authentication - FIDO2 or certificate-based
    Why it matters
    SMS and app-based MFA are better than passwords alone but are still vulnerable to phishing and SIM-swap attacks. Hardware-backed authentication eliminates those vectors entirely.
  • Continuous evaluation of identity risk signals
    Why it matters
    A valid login at 9am doesn't mean the session is safe at 2pm. Conditional Access
    Conditional Access
    A policy engine that evaluates sign-in risk, device health, and user context continuously - not just at login - to decide whether to allow, block, or challenge each request.
    with continuous access evaluation re-checks risk throughout a session, not just at authentication time.
1
"Identity is the new perimeter. Everything else depends on getting this right first."
1
Network perimeters are gone - every request must be authenticated and authorized regardless of where it originates
2
Zero standing access - admin privilege is granted only when needed and removed automatically when tasks are done
3
Risk is continuous - authentication is a point-in-time check; trust must be re-evaluated throughout every session
Principle Two
Least Privilege Is an Operational Discipline
Minimize Blast Radius
  • Role-based access design across all workloads
    Why it matters
    When everyone has broad access "just in case," a single compromised account can cause catastrophic damage. Role-based design limits what any one account can touch to only what it legitimately needs.
  • Privileged Identity Management
    PIM - Privileged Identity Management
    Entra ID feature that enables just-in-time privileged access - users request elevation, it is approved, they get access for a defined window, then it is removed automatically.
    for elevation workflows
    Why it matters
    Standing Global Admin accounts are one of the most common audit findings and one of the most targeted attack vectors. PIM eliminates them without eliminating the ability to perform privileged tasks when legitimately needed.
  • Regular review cycles for administrative assignments
    Why it matters
    Access tends to accumulate. People change roles, projects end, contractors leave - but their access often stays. Scheduled access reviews catch this before it becomes a security or compliance issue.
  • Elimination of standing Global Administrator access
    Why it matters
    A Global Admin account that exists all the time is a high-value target at all times. With PIM, that target only exists during an active, time-limited, logged session - dramatically reducing the attack surface.
2
"Privilege should exist only long enough to do the job - then disappear."
1
Scope access tightly - minimum required permissions for the specific task, nothing broader
2
Time-limit everything elevated - no access should persist past the window it was needed for
3
Review and recertify regularly - access accumulates silently; scheduled reviews catch it before it becomes a liability
Principle Three
Platform Consistency Reduces Risk
Govern Before You Grow
  • Azure landing zone
    Azure Landing Zone
    A pre-configured Azure environment with defined management groups, policies, networking, and governance built in before workloads are deployed - the foundation for consistent, governable cloud operations.
    structure and management groups
    Why it matters
    Starting without a landing zone means every workload team makes its own infrastructure decisions. The result is a patchwork of inconsistent configurations that is expensive to secure and nearly impossible to audit.
  • Policy-driven configuration baselines across environments
    Why it matters
    Azure Policy
    Azure Policy
    Enforces organizational rules on Azure resources - can prevent non-compliant configurations from being created, audit existing ones, or automatically remediate deviations.
    enforces the baseline continuously - not just at deployment time. Resources that drift from the standard are flagged or automatically remediated before they cause problems.
  • Standardized network and identity patterns
    Why it matters
    When every workload uses the same network topology and identity integration pattern, troubleshooting is faster, security reviews are simpler, and new team members can get up to speed without decoding a different architecture every time.
  • Consistent monitoring and logging across all workloads
    Why it matters
    You can't correlate a security event across environments that log differently. Consistent log format, retention, and destination is what makes investigation possible - and what makes compliance audits manageable.
3
"Configuration drift is slow, quiet, and one of the most common root causes of cloud security incidents."
1
Build governance in early - a landing zone established before workloads arrive costs a fraction of retrofitting it later
2
Enforce continuously - policy that runs at deployment time only misses everything that changes afterward
3
Consistency enables speed - teams move faster when they don't have to re-solve infrastructure decisions that have already been made
Principle Four
Security Must Be Integrated into Operations
Security as Daily Practice
  • Unified security telemetry through Microsoft Defender XDR
    Microsoft Defender XDR
    Extended Detection and Response platform that correlates signals across endpoints, identity, email, and cloud apps into a single investigation and response experience.
    Why it matters
    Security tools that don't talk to each other force analysts to manually correlate signals across consoles. Defender XDR surfaces the full attack chain in one place - from the initial phishing email through lateral movement to the target resource.
  • Centralized logging and investigation workflows
    Why it matters
    When logs live in different systems with different formats and different retention windows, incident response slows to a crawl. Centralized logging through Microsoft Sentinel
    Microsoft Sentinel
    Microsoft's cloud-native SIEM and SOAR platform - ingests signals from across the environment and enables automated detection, investigation, and response workflows.
    turns hours of manual log hunting into minutes.
  • Automated alert correlation where appropriate
    Why it matters
    Security teams are drowning in alerts. Correlation rules that group related low-fidelity signals into high-fidelity incidents dramatically reduce alert fatigue and surface what actually matters - without requiring human review of every individual event.
  • Clear incident response procedures, tested regularly
    Why it matters
    An incident response plan that has never been tested is a document, not a capability. Tabletop exercises and regular reviews of actual incident timelines are what build the muscle memory that matters when something real happens.
4
"Security that lives in a separate lane from operations will always be too slow to matter."
1
Visibility across the full stack - identity, endpoints, email, cloud apps, and infrastructure in one correlated view
2
Reduce signal noise - automated correlation surfaces real incidents from thousands of individual alerts
3
Response must be practiced - plans that have never been tested are not plans, they are intentions
Principle Five
Infrastructure Must Be Observable
You Cannot Fix What You Cannot See
  • Centralized logging and diagnostics across all layers
    Why it matters
    Problems in cloud environments rarely announce themselves cleanly. Centralized diagnostics across identity, network, compute, and application layers let you see what actually happened - not just what the affected service reported.
  • Monitoring of configuration drift from defined baselines
    Why it matters
    Environments change - sometimes deliberately, sometimes not. Drift monitoring alerts when a setting deviates from the approved baseline before it becomes a vulnerability, a compliance finding, or an outage root cause.
  • Operational dashboards for key platform services
    Why it matters
    Dashboards aren't just for executives. An operations team with a clear daily view of platform health - identity, endpoint compliance, security posture, resource utilization - catches problems proactively rather than reactively.
  • Log retention sufficient for investigation and compliance
    Why it matters
    Incidents are rarely discovered in real time. When investigation begins weeks or months later, retention policies determine whether the data you need still exists. Compliance frameworks specify minimums; security best practice often requires more.
5
"Observability isn't about collecting everything. It's about knowing where to look when something goes wrong."
1
Visibility across all layers - identity, endpoints, applications, and cloud resources, not just uptime dashboards
2
Ongoing drift detection - catch configuration changes before they become incidents or audit findings
3
Retain for investigation - you cannot investigate an incident with logs that have already been purged
Principle Six
Documentation Enables Continuity
Architecture Decisions Have a Memory
  • Architecture diagrams and current-state design standards
    Why it matters
    Architecture documentation that reflects aspirational state rather than actual state is actively harmful - it leads teams to make decisions based on a picture of the environment that doesn't exist. Current-state documentation is the only kind that has operational value.
  • Security and governance policies in accessible form
    Why it matters
    Policies that exist only as PDFs in a SharePoint library are not operational. Security and governance policies need to be findable, readable, and maintained - otherwise teams make judgment calls that contradict them without realizing it.
  • Operational runbooks for routine and emergency procedures
    Why it matters
    I write runbooks the team can actually follow under pressure - not architecture documents that require deep context to interpret. If the on-call engineer has never seen the system before, the runbook should still get them through it.
  • Incident review summaries that inform future decisions
    Why it matters
    Incidents that are resolved without a written review get repeated. A concise post-incident summary - what happened, why, what changed - turns a painful event into an institutional learning that prevents the next one.
6
"Documentation that reflects what was planned, not what was built, is worse than no documentation at all."
1
Current state only - aspirational architecture diagrams mislead; accurate ones enable
2
Runbooks over reference docs - operational documentation must work under pressure, not just on a good day
3
Incidents are a curriculum - every unreviewed outage is a lesson the organization paid for but never learned
Principle Seven
Automation Should Assist Human Operators
Speed With Judgment
  • Infrastructure deployment through templates and policy
    Why it matters
    Manual infrastructure deployments introduce variability. Every environment deployed from a validated template or policy baseline starts in a known-good state - reducing the surface area for configuration errors from the first moment.
  • Automated configuration validation and drift detection
    Why it matters
    Checking 25 tenants manually for configuration drift is not operationally viable. PowerShell-driven validation against a defined baseline runs in minutes and surfaces deviations before they compound into larger problems.
  • AI-assisted investigation and incident summarization
    Why it matters
    Tools like Security Copilot
    Microsoft Security Copilot
    AI-powered security analysis tool that synthesizes signals from Defender, Sentinel, and Entra ID to accelerate investigation - summarizing incidents, explaining findings, and suggesting remediation steps.
    can synthesize a complex incident timeline in seconds. That doesn't replace analyst judgment - it eliminates the data-gathering work so analysts can focus on decision-making.
  • Human review preserved for decisions with material impact
    Why it matters
    Automation is a force multiplier, not a replacement for judgment. Routine, well-understood tasks are candidates for automation. Decisions with significant security, compliance, or operational consequences require a human in the loop - full stop.
7
"Automation should make good operators faster, not replace the judgment that makes them good."
1
Deploy from baselines - every environment starts in a known-good state, not wherever manual steps happened to land
2
AI as analyst support - Security Copilot and Defender AI accelerate investigation; they don't replace the analyst making the call
3
Humans own material decisions - full transparency and traceability in every automated action, with human review where consequences are significant
In Practice
How These Principles Work Together
These principles are not a checklist. They are a lens for evaluating tradeoffs. When an organization is under pressure to move fast, they are what prevent speed from becoming debt. When an environment is mature, they are what keep it that way.
  • Identity - the security boundary everything else depends on
  • Least Privilege - limits the damage any single compromise can cause
  • Consistency - prevents the drift that erodes both security and manageability
  • Security Ops - makes threats visible and usable in daily operations
  • Observability - makes the environment knowable and investigations possible
  • Documentation - preserves institutional knowledge across people and time
  • Automation - amplifies human capability without replacing human judgment
Good architecture isn't about the technology - it's about whether the people relying on it can do their jobs without thinking about it.