Fastest/best performing way to load all files? #586

LegendaryB · 2021-10-18T13:23:10Z

LegendaryB
Oct 18, 2021

We need to query several document libraries and retrieve all files from there. Also we need some custom metadata. The metadata is not available directly at the IFile object so I suppose we need to use IList instead, correct?

Is there any performant way to do that with as many requests as possible? I'm thinking that throttling could become a problem here.

Our current approach is: Create Context for a Site -> Get the wanted document libraries -> loop through the Items property -> ListItem is file == true => Download / Grab and store metadata and so on.

Also I would like to know how to handle throttling? Does the library handle it in any way? Even a custom exception or something like that would be enough for us. We would then retry it at a later time.

Answered by jansenbe

Oct 18, 2021

Hi @LegendaryB ,

I've just made a change to PnP Core SDK that enables below approach to work. There's one request to find all applicable lists, one request per list per 500 items to find the downloadable items and then one request per actual file to download. Next to the option to download the file you also have the metadata of the list item to work with.

You'll have to wait until tomorrow for the next nightly build to make below code work.

// grab all document libraries that are not hidden
var lists = await context.Web.Lists.QueryProperties(p => p.Fields.QueryProperties(p => p.InternalName,
                                                    p => p.FieldTypeKind,
                        …

View full answer

jansenbe · 2021-10-18T14:22:14Z

jansenbe
Oct 18, 2021
Maintainer

Hi @LegendaryB ,

I've just made a change to PnP Core SDK that enables below approach to work. There's one request to find all applicable lists, one request per list per 500 items to find the downloadable items and then one request per actual file to download. Next to the option to download the file you also have the metadata of the list item to work with.

You'll have to wait until tomorrow for the next nightly build to make below code work.

// grab all document libraries that are not hidden
var lists = await context.Web.Lists.QueryProperties(p => p.Fields.QueryProperties(p => p.InternalName,
                                                    p => p.FieldTypeKind,
                                                    p => p.TypeAsString,
                                                    p => p.Title))
                                   .Where(p => p.TemplateType == ListTemplateType.DocumentLibrary && p.Hidden == false)
                                   .ToListAsync();

// iterate over the found libraries
foreach (var list in lists)
{                        
    // Query the library, filter on the files only and load the needed metadata (FieldRef's) using a paged approach
    // Use orderby to make the CAML query work for large libraries (avoids table scan in SQL backend)
    string viewXml = @"<View>
                        <ViewFields>
                          <FieldRef Name='Title' />
                          <FieldRef Name='FileLeafRef' />
                          <FieldRef Name='FSObjType'/>
                          <FieldRef Name='FileDirRef'/>
                        </ViewFields>
                        <Query>
                          <Where>
                            <Eq>
                              <FieldRef Name='FSObjType'/>
                              <Value Type='Integer'>0</Value>
                            </Eq>
                          </Where>
                        </Query>
                        <OrderBy Override='TRUE'><FieldRef Name='ID' Ascending='FALSE' /></OrderBy>
                        <RowLimit Paged='TRUE'>500</RowLimit>
                       </View>";

    bool paging = true;
    string nextPage = null;
    while (paging)
    {
        var output = await list.LoadListDataAsStreamAsync(new RenderListDataOptions()
        {
            ViewXml = viewXml,
            RenderOptions = RenderListDataOptionsFlags.ListData,
            Paging = nextPage ?? null,
        }).ConfigureAwait(false);

        if (output.ContainsKey("NextHref"))
        {
            nextPage = output["NextHref"].ToString().Substring(1);
        }
        else
        {
            paging = false;
        }
    }

    // Iterate over the retrieved list items and process them
    foreach (var listItem in list.Items.AsRequested())
    {

        // Use your metadata
        if (listItem["FileLeafRef"].ToString().EndsWith(".docx", StringComparison.InvariantCultureIgnoreCase))
        {
            // do something Word specific
        }

        // Download the the file behind the list item, use an async streaming approach to speed up things
        using (Stream downloadedContentStream = await listItem.File.GetContentAsync(true))
        {
            var bufferSize = 2 * 1024 * 1024;  // 2 MB buffer
            using (var content = System.IO.File.Create($"e:\\temp\\downloadtest\\{listItem["FileLeafRef"]}.downloaded"))
            {
                var buffer = new byte[bufferSize];
                int read;
                while ((read = await downloadedContentStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
                {
                    content.Write(buffer, 0, read);
                }
            }
        }
    }
}

0 replies

LegendaryB · 2021-10-18T14:49:03Z

LegendaryB
Oct 18, 2021
Author

Thank you @jansenbe,

also I would like to know how to handle throttling? Does the library handle it in any way? Even a custom exception or something like that would be enough for us. We would then retry it at a later time.

Also I can't get this to work:

var itemsEnumerable = library.Items
                .QueryProperties(
                    p => p.UniqueId,
                    p => p.FileSystemObjectType,
                    p => p.Properties)
                .Where(p => p.FileSystemObjectType == FileSystemObjectType.File)
                .AsAsyncEnumerable();

I also tried that:

var itemsEnumerable = library.Items
                .Where(p => p.FileSystemObjectType == FileSystemObjectType.File)
                .AsAsyncEnumerable();

As you can see I would like to retrieve just the files in this call so that I don't need to process folders for example in my loop

0 replies

jansenbe · 2021-10-18T15:02:04Z

jansenbe
Oct 18, 2021
Maintainer

@LegendaryB : throttling is handled automatically, the library will wait and retry. See https://pnp.github.io/pnpcore/using-the-sdk/basics-settings.html#settings-overview for more details on the throttling configuration options.

About your other question: please use the approach I've outlined above, it already excludes folders via the filter on FSObjType

3 replies

LegendaryB Oct 18, 2021
Author

Thanks will take a look. Okay so its only possible via CAML and not with the Linq API then?

jansenbe Oct 18, 2021
Maintainer

@LegendaryB, the thing is that you cannot filter on FileSystemObjectType via a regular REST query, which is what our LINQ provider builds. Luckily doing this filter via a CAML query works. If you don't want to use the above approach then just load all list items with the FileSystemObjectType property and filter client side

LegendaryB Oct 18, 2021
Author

Thanks for the very quick responses. Really a life saver 👍
Will consider using the CAML query instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fastest/best performing way to load all files? #586

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fastest/best performing way to load all files? #586

Uh oh!

Uh oh!

LegendaryB Oct 18, 2021

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

jansenbe Oct 18, 2021 Maintainer

Uh oh!

LegendaryB Oct 18, 2021 Author

Uh oh!

jansenbe Oct 18, 2021 Maintainer

Uh oh!

LegendaryB Oct 18, 2021 Author

Uh oh!

jansenbe Oct 18, 2021 Maintainer

Uh oh!

LegendaryB Oct 18, 2021 Author

LegendaryB
Oct 18, 2021

Replies: 3 comments 3 replies

jansenbe
Oct 18, 2021
Maintainer

LegendaryB
Oct 18, 2021
Author

jansenbe
Oct 18, 2021
Maintainer

LegendaryB Oct 18, 2021
Author

jansenbe Oct 18, 2021
Maintainer

LegendaryB Oct 18, 2021
Author