Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),

## [Unreleased]

## [3.3.0]

Serialization and ergonomic polish for the typed value objects, a canonical display-form method, and an opt-in local-part normalizer callback. All additions are non-breaking for v3.2 callers.

### Added
- `ParsedEmailAddress::toArray(): array` — round-trips to the legacy array shape produced by `Parse::parse()`. Useful when mixing typed and array-based code.
- `ParsedEmailAddress::toJson(int $flags = 0): string` — convenience wrapper over `json_encode` with `JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES`. `ParseErrorCode` serializes to its backing string value.
- `ParseResult::toArray()` and `ParseResult::toJson()` — same for the multi-address container; each entry is serialized via `ParsedEmailAddress::toArray()`.
- `ParsedEmailAddress implements \Stringable` — `(string) $parsed` returns the `simpleAddress` for valid addresses, empty string otherwise. Lets a parsed address drop directly into string contexts (logging, templates, etc.).
- `ParsedEmailAddress::canonical(): string` — canonical RFC 5322 display form with minimal quoting per §3.2.4 (local-part) and §3.2.5 (phrase). Drops unnecessary quotes that `$address` may preserve from the input, and adds quotes only where required. Returns empty string for invalid addresses.
- `ParseOptions::$localPartNormalizer` (readonly `?\Closure`) + `withLocalPartNormalizer(?callable)` fluent builder. The callback `fn(string $localPart, string $domain): string` is invoked after local-part validation succeeds; the returned string replaces `local_part_parsed` in the output. Typical uses: Gmail dot-insensitivity, `+tag` plus-addressing, or any domain-specific canonicalization. `originalAddress` still preserves the verbatim input.

### Changed
- None — all additions; no behavior changes for existing callers.

## [3.2.0]

Streaming batch parsing, severity classification for validation errors, RFC 5322 §4.4 obs-route support, and broader CFWS tolerance around addr-spec boundaries. All additions are non-breaking for v3.1 callers.
Expand Down
41 changes: 26 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ foreach (Parse::getInstance()->parseStream($csvRows) as $addr) {
if ($addr->invalid) continue;
// ...
}

// Serialization (v3.3+)
$parsed = Parse::getInstance()->parseSingle('"J Doe" <j@example.com>');
(string) $parsed; // "j@example.com" — Stringable returns simple_address
$parsed->canonical(); // 'J Doe <j@example.com>' — minimal RFC 5322 quoting
$parsed->toArray(); // legacy array shape, for mixed-API code
$parsed->toJson(); // JSON string
```

### Advanced Usage with ParseOptions
Expand Down Expand Up @@ -94,7 +101,7 @@ $parser = new Parse(null, $options);
// RFC 6531 — Strict Internationalized (full UTF-8 + NFC normalization)
$options = ParseOptions::rfc6531();
$parser = new Parse(null, $options);
$result = $parser->parse('müller@münchen.de', false); // Valid UTF-8 address
$result = $parser->parseSingle('müller@münchen.de'); // Valid UTF-8 address

// RFC 5322 — Standard with obsolete syntax support (recommended)
$options = ParseOptions::rfc5322();
Expand Down Expand Up @@ -133,10 +140,10 @@ $parser = new Parse(null, $options);
$options = ParseOptions::rfc6531();
$parser = new Parse(null, $options);

$result = $parser->parse('José.García@españa.es', false);
$result = $parser->parseSingle('José.García@españa.es');
// Valid: UTF-8 characters allowed in rfc6531() preset

$result = $parser->parse('.user@example.com', false);
$result = $parser->parseSingle('.user@example.com');
// Invalid: Leading dot not allowed (dot-atom restrictions still apply)
```

Expand Down Expand Up @@ -173,6 +180,7 @@ $parser = new Parse(null, $options);
| `validateDisplayNamePhrase` | `false` | Enforce RFC 5322 §3.2.5 phrase syntax on unquoted display names |
| `strictIdna` | `false` | Apply full IDNA2008 conformance on U-label domains (RFC 5891/5892/5893) |
| `allowObsRoute` | `false` | Accept RFC 5322 §4.4 obs-route source-routes like `<@host1,@host2:user@host3>` |
| `localPartNormalizer` | `null` | `?callable(string $local, string $domain): string` — domain-specific canonicalization hook (v3.3+); set via `withLocalPartNormalizer()` |
| **Length & Output** | | |
| `enforceLengthLimits` | `true` | Enforce RFC 5321 length limits (64/254/63) |
| `includeDomainAscii` | `false` | Include punycode `domain_ascii` in output |
Expand Down Expand Up @@ -244,9 +252,9 @@ The `domain_ascii` field is included in the output when `includeDomainAscii` is
```php
$options = ParseOptions::rfc6531();
$parser = new Parse(null, $options);
$result = $parser->parse('user@bücher.de', false);
// $result['domain'] = 'bücher.de'
// $result['domain_ascii'] = 'xn--bcher-kva.de'
$result = $parser->parseSingle('user@bücher.de');
// $result->domain === 'bücher.de'
// $result->domainAscii === 'xn--bcher-kva.de'
```

### Comment Extraction
Expand All @@ -257,20 +265,20 @@ RFC 5322 allows comments in email addresses using parentheses. The parser automa
use Email\Parse;

// Single comment
$result = Parse::getInstance()->parse('john@example.com (home address)', false);
// $result['comments'] = ['home address']
$result = Parse::getInstance()->parseSingle('john@example.com (home address)');
// $result->comments === ['home address']

// Multiple comments
$result = Parse::getInstance()->parse('test(comment1)(comment2)@example.com', false);
// $result['comments'] = ['comment1', 'comment2']
$result = Parse::getInstance()->parseSingle('test(comment1)(comment2)@example.com');
// $result->comments === ['comment1', 'comment2']

// Nested comments
$result = Parse::getInstance()->parse('test@example.com (comment with (nested) parens)', false);
// $result['comments'] = ['comment with (nested) parens']
$result = Parse::getInstance()->parseSingle('test@example.com (comment with (nested) parens)');
// $result->comments === ['comment with (nested) parens']

// No comments
$result = Parse::getInstance()->parse('test@example.com', false);
// $result['comments'] = []
$result = Parse::getInstance()->parseSingle('test@example.com');
// $result->comments === []
```

Comments are stripped from the `address` field but preserved in `original_address`.
Expand Down Expand Up @@ -304,7 +312,7 @@ $parser = new Parse(null, $options);
// Use the rfc6531() preset for full internationalized email support
$options = ParseOptions::rfc6531();
$parser = new Parse(null, $options);
$result = $parser->parse('müller@münchen.de', false);
$result = $parser->parseSingle('müller@münchen.de');
```

#### Function Spec ####
Expand Down Expand Up @@ -357,6 +365,9 @@ $result = $parser->parse('müller@münchen.de', false);

Other Examples:
---------------

The following examples use the legacy array-returning `parse()` method to document its full output shape. New code should prefer `parseSingle()` / `parseMultiple()` (see Basic Usage) for typed return values; both APIs expose the same underlying fields.

```php
$email = '"J Doe" <johndoe@xyz.com>';
$result = Email\Parse::getInstance()->parse($email, false);
Expand Down
54 changes: 49 additions & 5 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,50 @@ Future plans by version. Items here are intent, not commitment — priority and
- [x] `obs-domain-list` — the `*("," [CFWS] ["@" domain])` shape is consumed inside `STATE_OBS_ROUTE`.
- [x] CFWS (comments / folding whitespace) improvements — look-ahead in the whitespace handler now absorbs CFWS at dot-atom boundaries (`local @domain`, `local@ domain`, `local @ domain`) and around angle-addr delimiters (`< local@domain >`, `<local @ domain>`), including folded whitespace (LF + WSP). Comments in these positions were already supported in v3.0.

## v3.3 — Polish, Ergonomics — shipped

Non-breaking follow-on to v3.2.

**Serialization ergonomics:**
- [x] `ParsedEmailAddress::toArray(): array<string, mixed>` — round-trips to the legacy array shape for callers mixing typed and array-based code.
- [x] `ParsedEmailAddress::toJson(int $flags = 0): string` — convenience wrapper over `json_encode` with `JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES`.
- [x] `implements \Stringable` on `ParsedEmailAddress` — returns `simpleAddress` for valid addresses; empty string otherwise. Drops directly into string contexts.
- [x] `ParseResult::toArray()` and `toJson()` counterparts.

**Canonicalization (pulled forward from v4.0):**
- [x] `ParsedEmailAddress::canonical(): string` — minimal-quoting RFC 5322 display form per §3.2.4 (local-part) and §3.2.5 (phrase).
- [x] Optional local-part normalizer callback on `ParseOptions` for domain-specific rules (Gmail dot-insensitivity, `+tag` plus-addressing). Attached via `withLocalPartNormalizer(?callable)`.

**Ecosystem bridges:** *(deferred — out of scope for v3.3 per user direction)*
- [ ] `mmucklo/email-parse-symfony` — Symfony `Constraint` + `ConstraintValidator` attribute. Wraps existing `ParseOptions` presets.
- [ ] `mmucklo/email-parse-laravel` — Laravel validation rule, service provider for DI.
- [ ] PSR-14 event dispatcher integration — emit a `ParsedAddressEvent` per result for observability.

## Quality and Infrastructure (ongoing)

Not tied to a specific release; picked up as time allows.

**Testing depth:**
- [ ] Mutation testing with Infection. Surfaces tests whose assertions are too weak to catch small code mutations. Target ≥85% MSI (mutation score indicator).
- [ ] Property-based testing (Eris or Pest plugin): generate random valid addresses, assert `parseSingle(parseSingle($x)->simpleAddress)` round-trips; perturb bytes and assert error codes.
- [ ] Parse.php line coverage 86.69% → ≥95% — remaining gaps are obscure error branches and the "shouldn't ever get here" default case.
- [ ] CI matrix: add PHP 8.5 once released.

**Static analysis:**
- [ ] PHPStan level 6 → 8 (or `max`) — tighter generics and inference on the state machine. Likely requires additional docblock array shapes.
- [ ] Add Psalm alongside PHPStan for cross-tool coverage; keep both green.

**Performance:**
- [ ] PhpBench suite: parsing throughput for realistic inputs (single ASCII, multi-address batch, UTF-8, IDN, obs-route). Establishes a baseline before any optimization.
- [ ] Profile the state machine under mailing-list-sized inputs. Likely hot path: `mb_substr` in the main loop — investigate byte iteration for pure-ASCII inputs.

**Community / documentation:**
- [ ] `CONTRIBUTING.md` with dev setup, CI expectations, and commit-style guidance.
- [ ] GitHub issue + pull-request templates.
- [ ] `CODE_OF_CONDUCT.md`.
- [ ] Examples directory or GitHub Pages cookbook (UTF-8 addresses, obs-route in practice, custom normalizers once they ship, Symfony/Laravel integration snippets).
- [ ] README cleanup — split the large reference tables into `docs/` sub-pages if the top-level README grows further.

## v4.0 — Breaking Modernization

**API cleanup:**
Expand All @@ -65,8 +109,8 @@ Future plans by version. Items here are intent, not commitment — priority and
- [ ] Deprecate or remove the `getInstance()` singleton (recommend explicit instantiation).
- [ ] Constructor promotion on `ParseOptions` with named arguments.

**New capabilities:**
- [ ] Optional DNS/MX validation via callback interface (`DnsValidator`).
- [ ] Group syntax support (RFC 6854: `Group Name: addr1, addr2;`).
- [ ] `canonicalize(ParsedEmailAddress): string` — standard display form.
- [ ] Optional local-part normalizer callback for domain-specific rules (e.g. Gmail dot-insensitivity, plus-addressing).
**New capabilities (genuinely breaking or late-binding):**
- [ ] Optional DNS/MX validation via callback interface (`DnsValidator`). Breaking because the Parse constructor signature grows, and because synchronous DNS lookups change performance characteristics meaningfully.
- [ ] Group syntax support (RFC 6854: `Group Name: addr1, addr2;`). Breaking because it introduces a new output-container shape for grouped results.

*Note: `canonicalize()` and the local-part normalizer callback were moved to v3.3 as additive (non-breaking) features.*
30 changes: 30 additions & 0 deletions UPGRADE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# Upgrade Guide

## v3.2 → v3.3

v3.3 is fully additive — no breaking changes, no behavior changes for existing callers. Everything listed here is opt-in.

### Additions

- **Serialization**: `ParsedEmailAddress::toArray()`, `ParsedEmailAddress::toJson()`, and the corresponding methods on `ParseResult`. Use these to round-trip typed objects back to the legacy array shape or emit JSON:
```php
$result = $parser->parseSingle('user@example.com');
$result->toArray(); // legacy array shape
$result->toJson(); // JSON string; ParseErrorCode serializes to its backing value
```
- **`implements \Stringable`** on `ParsedEmailAddress` — `(string) $parsed` returns `simpleAddress` for valid addresses, empty string otherwise. Lets a parsed address drop into string contexts (logging, templating, concatenation).
- **`ParsedEmailAddress::canonical()`** — minimal-quoting RFC 5322 display form. Drops unnecessary quotes that the `$address` field may preserve from the input; adds quotes only when §3.2.4 / §3.2.5 require them.
- **Local-part normalizer callback** — configure with `withLocalPartNormalizer(fn(string $local, string $domain): string)`. Invoked only after successful validation; the returned string replaces `local_part_parsed`. `originalAddress` still preserves the verbatim input. Example (Gmail):
```php
$opts = ParseOptions::rfc5322()->withLocalPartNormalizer(
fn (string $local, string $domain): string =>
$domain === 'gmail.com'
? ($plus = strpos(str_replace('.', '', $local), '+')) === false
? str_replace('.', '', $local)
: substr(str_replace('.', '', $local), 0, $plus)
: $local,
);
```

### Minimum Requirements (Unchanged)

PHP `^8.1`, `ext-mbstring`, `ext-intl`.

## v3.1 → v3.2

v3.2 is fully additive — no breaking changes. Two behavior changes are worth noting for callers who depended on them:
Expand Down
18 changes: 18 additions & 0 deletions src/Parse.php
Original file line number Diff line number Diff line change
Expand Up @@ -1079,6 +1079,24 @@ private function addAddress(
? "\"{$emailAddress['local_part_parsed']}\""
: $emailAddress['local_part_parsed'];
}

// Optional caller-supplied local-part normalizer — invoked after structural
// validation so the callback only sees addresses that already conform to
// the configured ParseOptions rules. Typical uses: Gmail dot-insensitivity
// (`john.doe` → `johndoe`), plus-addressing (`user+tag` → `user`), or any
// domain-specific canonicalization. The returned string replaces
// local_part_parsed and the display form is re-derived; `original_address`
// still preserves the verbatim input.
if (!$emailAddress['invalid'] && $this->options->localPartNormalizer !== null) {
$normalizer = $this->options->localPartNormalizer;
$normalized = $normalizer($emailAddress['local_part_parsed'], $emailAddress['domain']);
if ($normalized !== $emailAddress['local_part_parsed']) {
$emailAddress['local_part_parsed'] = $normalized;
$localPart = $emailAddress['local_part_quoted']
? "\"{$emailAddress['local_part_parsed']}\""
: $emailAddress['local_part_parsed'];
}
}
}

// FQDN check
Expand Down
28 changes: 28 additions & 0 deletions src/ParseOptions.php
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ class ParseOptions
* @param bool $validateDisplayNamePhrase Enforce RFC 5322 §3.2.5 phrase syntax for unquoted display names (atext + WSP only).
* @param bool $strictIdna Apply full IDNA2008 conformance on U-label domains (CONTEXTJ/O, Bidi rule, STD3, nontransitional mapping).
* @param bool $allowObsRoute Accept RFC 5322 §4.4 obs-route source-route prefix inside angle-addr (e.g. `<@host1,@host2:user@host3>`); the route is captured and the real addr-spec is used ("accept and discard" per spec).
* @param ?\Closure $localPartNormalizer Optional callback `fn(string $localPart, string $domain): string` invoked after local-part validation succeeds. The returned string replaces `local_part_parsed` in the output (and is re-quoted if needed). Typical uses: Gmail dot-insensitivity, `+tag` plus-addressing.
*/
public function __construct(
array $bannedChars = [],
Expand All @@ -66,6 +67,7 @@ public function __construct(
public readonly bool $validateDisplayNamePhrase = false,
public readonly bool $strictIdna = false,
public readonly bool $allowObsRoute = false,
public readonly ?\Closure $localPartNormalizer = null,
) {
foreach ($bannedChars as $char) {
$this->bannedChars[$char] = true;
Expand Down Expand Up @@ -304,6 +306,29 @@ public function withAllowObsRoute(bool $value): self
return $this->cloneWith(['allowObsRoute' => $value]);
}

/**
* Supply a local-part normalizer callback, or `null` to clear any current one.
*
* The callback is invoked after local-part validation succeeds with
* `fn(string $localPart, string $domain): string`. Its return value
* replaces `local_part_parsed` in the output — typical uses are Gmail
* dot-insensitivity (`john.doe` → `johndoe`) and plus-addressing
* (`user+tag` → `user`), typically gated on the domain.
*
* $opts = ParseOptions::rfc5322()->withLocalPartNormalizer(
* fn(string $local, string $domain): string =>
* $domain === 'gmail.com'
* ? strtolower(strstr(str_replace('.', '', $local), '+', true) ?: str_replace('.', '', $local))
* : $local,
* );
*/
public function withLocalPartNormalizer(?callable $normalizer): self
{
return $this->cloneWith([
'localPartNormalizer' => $normalizer === null ? null : \Closure::fromCallable($normalizer),
]);
}

/**
* Build a new ParseOptions preserving every current value except those
* listed in $overrides.
Expand Down Expand Up @@ -336,6 +361,9 @@ private function cloneWith(array $overrides): self
validateDisplayNamePhrase: $get('validateDisplayNamePhrase', $this->validateDisplayNamePhrase),
strictIdna: $get('strictIdna', $this->strictIdna),
allowObsRoute: $get('allowObsRoute', $this->allowObsRoute),
localPartNormalizer: array_key_exists('localPartNormalizer', $overrides)
? $overrides['localPartNormalizer']
: $this->localPartNormalizer,
);
}

Expand Down
31 changes: 31 additions & 0 deletions src/ParseResult.php
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,35 @@ public static function fromArray(array $arr): self
),
);
}

/**
* Round-trip to the array shape produced by {@see Parse::parse()} in
* multi-address mode. Each address is serialized via
* {@see ParsedEmailAddress::toArray()}.
*
* @return array{success: bool, reason: ?string, email_addresses: array<int, array<string, mixed>>}
*/
public function toArray(): array
{
return [
'success' => $this->success,
'reason' => $this->reason,
'email_addresses' => array_map(
fn (ParsedEmailAddress $a) => $a->toArray(),
$this->emailAddresses,
),
];
}

/**
* JSON-encoded representation. Convenience wrapper over {@see toArray()}.
*
* @param int $flags Flags passed through to `json_encode` (e.g. `JSON_PRETTY_PRINT`).
*/
public function toJson(int $flags = 0): string
{
$encoded = json_encode($this->toArray(), $flags | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);

return $encoded === false ? '{}' : $encoded;
}
}
Loading
Loading