Skip to content

Commit 2191e22

Browse files
authored
[Blueprints] Support Data Liberation importer in the importWxr step (#2058)
## Description Adds the Data Liberation WXR importer as an option in the `importWxr` step. The new importer is turned by including the `"importer": "data-liberation"` option: ```json { "steps": [ { "step": "importWxr", "file": { "resource": "url", "url": "https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml" }, "importer": "data-liberation" } ] } ``` When the `importer` option is missing or set to "default," nothing changes in the behavior of the step and it continues using the https://github.com/humanmade/WordPress-Importer importer. The new importer: * Rewrites links in the imported content * Downloads assets through Playground's CORS proxy * Parallelizes the downloads * Communicates progress This PR is a part of #1894 ## Implementation details This `importWxr` step fetches and includes the `data-liberation-core.phar` file. The phar file is built with [Box](https://box-project.github.io/box/configuration/) and contains the importer library with its dependencies, which is a subset of the Data Liberation library, a subset of the Blueprints library, and a few vendor libraries. This, unfortunately, means that any changes in the PHP files require rebuilding the .phar file. Here's how you can do it: ```bash nx build:phar playground-data-liberation ``` You can also build the entire Data Liberation package as a WordPress plugin complete with a wp-admin page: ```bash nx build:plugin playground-data-liberation ``` Both commands will output the built files to `packages/playground/data-liberation/dist` The progress updates are a first-class feature of the new importer. The updated `importer` step receives them in real-time via a `post_message_to_js()` call running after every import step. Then, it passes them on to the progress bar UI. ### Other changes * **TLS traffic now goes through the CORS proxy.** Since the new importer uses `AsyncHTTP\Client` which deals with raw sockets, Playground's [TLS-based network bridge](#1926) runs the outbound traffic through a cors proxy. Technically, `TCPOverFetchWebsocket` gets the `corsProxy` URL passed to the `playground.boot()` call. * A few composer dependencies were forked, downgraded to PHP 7.2 using Rector, and bundled with this PR to keep the Data Liberation importer working. ## Remaining work - [x] PHP 7.2 compatibility. Done by forking and Rector-downgrading dependencies that were incompatible with PHP 7.2. - [x] Report the importer's progress on the overall Blueprint progress bar - [x] Enqueue the data liberation plugin files for downloading at the blueprint compilation stage - [x] Don't eagerly rewrite attachments URLs in `WP_Stream_Importer`. Exposing this information to the API consumer requires an explicit decision. Do we rewrite it? Or do we ignore it? - [x] Fix the TLS errors at the intersection of Playground network transport and the async HTTP client library - [x] Separate the markdown importer and its dependencies (md parser, frontmatter parser, Symfony libraries) from the core plugin - [x] Ship the importer and its tree-shaken deps (URL parser) as a minified zip/phar ## Follow-up work - [ ] Reconsider the `WP_Import_Session` API – do we need so many verbosely named methods? Can we achieve the same outcomes with fewer methods? - [ ] Investigate why there's a significant delay before media downloads start on PHP 7.2 – 7.4. It's likely a PHP.wasm issue. ## Testing instructions * Default importer – [Open this link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}) and confirm it does what the current `importWxr` step do, that is it stays at "Importing content" for a moment, fails to fetch media files (CORS issues in network tools), but inserts posts and pages. * Data Liberation – [Open this link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22importer%22:%20%22data-liberation%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}), confirm the import progress is visible and that the content and media indeed get imported: ![CleanShot 2024-12-08 at 14 54 49@2x](https://github.com/user-attachments/assets/a7da3244-a10f-43d2-8e94-43d305220a7e) ## Related issues * #1211 * #2012 * #1477 * #1250 * #1780
1 parent 23ffd14 commit 2191e22

File tree

300 files changed

+38836
-3456
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

300 files changed

+38836
-3456
lines changed

.github/workflows/ci.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ jobs:
103103
- name: Install Playwright Browsers
104104
run: sudo npx playwright install --with-deps
105105
- name: Prepare app deploy and offline mode
106-
run: npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
106+
run: CORS_PROXY_URL=http://127.0.0.1:5263/cors-proxy.php? npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
107107
- name: Zip dist
108108
run: zip -r dist.zip dist
109109
- name: Upload dist

packages/php-wasm/universal/src/lib/php-worker.ts

+2-2
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,8 @@ export class PHPWorker implements LimitedPHPApi {
229229
}
230230

231231
/** @inheritDoc @php-wasm/universal!/PHP.onMessage */
232-
onMessage(listener: MessageListener): void {
233-
_private.get(this)!.php!.onMessage(listener);
232+
onMessage(listener: MessageListener) {
233+
return _private.get(this)!.php!.onMessage(listener);
234234
}
235235

236236
/** @inheritDoc @php-wasm/universal!/PHP.defineConstant */

packages/php-wasm/universal/src/lib/php.ts

+5
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,11 @@ export class PHP implements Disposable {
160160
*/
161161
onMessage(listener: MessageListener) {
162162
this.#messageListeners.push(listener);
163+
return async () => {
164+
this.#messageListeners = this.#messageListeners.filter(
165+
(l) => l !== listener
166+
);
167+
};
163168
}
164169

165170
async setSpawnHandler(handler: SpawnHandler | string) {

packages/php-wasm/web/src/lib/tcp-over-fetch-websocket.ts

+46-10
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ import { ContentTypes } from './tls/1_2/types';
4545

4646
export type TCPOverFetchOptions = {
4747
CAroot: GeneratedCertificate;
48+
corsProxyUrl?: string;
4849
};
4950

5051
/**
@@ -67,6 +68,7 @@ export const tcpOverFetchWebsocket = (tcpOptions: TCPOverFetchOptions) => {
6768
constructor(url: string, wsOptions: string[]) {
6869
super(url, wsOptions, {
6970
CAroot: tcpOptions.CAroot,
71+
corsProxyUrl: tcpOptions.corsProxyUrl,
7072
});
7173
}
7274
};
@@ -85,6 +87,7 @@ export interface TCPOverFetchWebsocketOptions {
8587
* clientDownstream stream and tracking the closure of that stream.
8688
*/
8789
outputType?: 'messages' | 'stream';
90+
corsProxyUrl?: string;
8891
}
8992

9093
export class TCPOverFetchWebsocket {
@@ -101,6 +104,7 @@ export class TCPOverFetchWebsocket {
101104
port = 0;
102105
listeners = new Map<string, any>();
103106
CAroot?: GeneratedCertificate;
107+
corsProxyUrl?: string;
104108

105109
clientUpstream = new TransformStream();
106110
clientUpstreamWriter = this.clientUpstream.writable.getWriter();
@@ -111,13 +115,18 @@ export class TCPOverFetchWebsocket {
111115
constructor(
112116
public url: string,
113117
public options: string[],
114-
{ CAroot, outputType = 'messages' }: TCPOverFetchWebsocketOptions = {}
118+
{
119+
CAroot,
120+
corsProxyUrl,
121+
outputType = 'messages',
122+
}: TCPOverFetchWebsocketOptions = {}
115123
) {
116124
const wsUrl = new URL(url);
117125
this.host = wsUrl.searchParams.get('host')!;
118126
this.port = parseInt(wsUrl.searchParams.get('port')!, 10);
119127
this.binaryType = 'arraybuffer';
120128

129+
this.corsProxyUrl = corsProxyUrl;
121130
this.CAroot = CAroot;
122131
if (outputType === 'messages') {
123132
this.clientDownstream.readable
@@ -307,9 +316,10 @@ export class TCPOverFetchWebsocket {
307316
'https'
308317
);
309318
try {
310-
await RawBytesFetch.fetchRawResponseBytes(request).pipeTo(
311-
tlsConnection.serverEnd.downstream.writable
312-
);
319+
await RawBytesFetch.fetchRawResponseBytes(
320+
request,
321+
this.corsProxyUrl
322+
).pipeTo(tlsConnection.serverEnd.downstream.writable);
313323
} catch (e) {
314324
// Ignore errors from fetch()
315325
// They are handled in the constructor
@@ -327,9 +337,10 @@ export class TCPOverFetchWebsocket {
327337
'http'
328338
);
329339
try {
330-
await RawBytesFetch.fetchRawResponseBytes(request).pipeTo(
331-
this.clientDownstream.writable
332-
);
340+
await RawBytesFetch.fetchRawResponseBytes(
341+
request,
342+
this.corsProxyUrl
343+
).pipeTo(this.clientDownstream.writable);
333344
} catch (e) {
334345
// Ignore errors from fetch()
335346
// They are handled in the constructor
@@ -409,7 +420,11 @@ class RawBytesFetch {
409420
/**
410421
* Streams a HTTP response including the status line and headers.
411422
*/
412-
static fetchRawResponseBytes(request: Request) {
423+
static fetchRawResponseBytes(request: Request, corsProxyUrl?: string) {
424+
const targetRequest = corsProxyUrl
425+
? new Request(`${corsProxyUrl}${request.url}`, request)
426+
: request;
427+
413428
// This initially used a TransformStream and piped the response
414429
// body to the writable side of the TransformStream.
415430
//
@@ -419,13 +434,34 @@ class RawBytesFetch {
419434
async start(controller) {
420435
let response: Response;
421436
try {
422-
response = await fetch(request);
423-
controller.enqueue(RawBytesFetch.headersAsBytes(response));
437+
response = await fetch(targetRequest);
424438
} catch (error) {
439+
/**
440+
* Pretend we've got a 400 Bad Request response whenever
441+
* the fetch() call fails.
442+
*
443+
* Just propagating an error and closing a WebSocket does
444+
* not make PHP aware the socket closed abruptly. This means
445+
* the AsyncHttp\Client will keep polling the socket indefinitely
446+
* until the request times out. This isn't perfect, as we want
447+
* to close the socket as soon as possible to avoid, e.g., 10 seconds
448+
* of unnecessary waitin for the timeout
449+
*
450+
* The root cause is unknown and likely related to the low-level
451+
* implementation of polling file descriptors. The following
452+
* workaround is far from ideal, but it must suffice until we
453+
* have a platform-level resolution.
454+
*/
455+
controller.enqueue(
456+
new TextEncoder().encode(
457+
'HTTP/1.1 400 Bad Request\r\nContent-Length: 0\r\n\r\n'
458+
)
459+
);
425460
controller.error(error);
426461
return;
427462
}
428463

464+
controller.enqueue(RawBytesFetch.headersAsBytes(response));
429465
const reader = response.body?.getReader();
430466
if (!reader) {
431467
controller.close();

0 commit comments

Comments
 (0)