Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob search using the glob(Unix style pathname pattern) syntax #40269

Open
martinResearch opened this issue Mar 28, 2025 · 4 comments
Open

Blob search using the glob(Unix style pathname pattern) syntax #40269

martinResearch opened this issue Mar 28, 2025 · 4 comments
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@martinResearch
Copy link

I would like to be able to efficiently list all the blobs in a container that match a Unix style pathname pattern.
as implemented in the python glob module

Describe the solution you'd like
I would like to be able to list blobs using for example

client.glob("dataset/sequence_*/image*_color.png")]

or

client.glob("dataset/**/image*_color.png", recursive=True)] 

We could list all the blobs in dataset and do the filtering locally, but that is very slow when the number of blobs is large and we are interested in selecting only a small subset. Ideally this feature would be implemented on the server side to avoid having to retrieve large lists of files locally, but I guess this would be out of scope for the python SDK.
I implemented my own solution based on the glob python package but would like this feature to be part of the SDK instead.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) labels Mar 28, 2025
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

Copy link

Hello @martinResearch. I'm an AI assistant for the azure-sdk-for-python repository. I have some suggestions that you can try out while the team gets back to you.

• Currently, blob listing via the SDK supports filtering by prefix (using the name_starts_with parameter) which is implemented on the server side, whereas glob-style pattern matching requires local filtering after retrieving the full list.

• A glob feature request would imply adding a new method (e.g. glob) that internally leverages string patterns to filter blob names in a more efficient way. If implemented server side, it could reduce latency and improve performance for large containers.

• Consideration should be given to whether the pattern matching can be delegated to the service. At this time, the service only supports prefix filters, so implementing a true glob syntax may require local processing.

• As a workaround, using existing methods like list_blobs(name_starts_with='dataset/') combined with Python’s glob module for filtering might be the current optimal solution.

• Further discussion on this feature might benefit from clarifying use cases and performance expectations to guide a possible server-side or optimized client-side enhancement.

The team will get back to you shortly, hopefully this helps in the meantime.

@jalauzon-msft
Copy link
Member

Hi @martinResearch, thanks for reaching out. As you mention, for this type of feature to have a real impact on efficiency, this is something that would need to be implemented server-side. You would need to reach out to the service team for that.

What did you have in mind for client-side changes that could improve this scenario? Keep in mind, the SDK does not load all Blobs at once and so filtering after the fact is not something we would be likely to do and would fall to the user application.

@jalauzon-msft jalauzon-msft added feature-request This issue requires a new behavior in the product in order be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 28, 2025
@martinResearch
Copy link
Author

My current implementation:

"""Filename globbing utility for Azure Container Clients.

Copied from the original python glob file and modified to work with an azure container client instead
of files on the drive.
The original python glob code can be found here
https://github.com/python/cpython/blob/3.7/Lib/glob.py
The logic has been modified in some places to accelerate the search by reducing
the number of queries to the azure storage sever.
"""

import fnmatch
import re
from typing import Iterable, Iterator, List, Tuple
from azure.storage.blob import ContainerClient
from azure.storage.blob import ContainerClient


# regular expressions used in original glob implementation
magic_check = re.compile("([*?[])")
magic_check_bytes = re.compile(b"([*?[])")

HTTPS_PREFIX = "https://"

class AzureGlob:
    """Class to perform globbing on azure storage clients."""

    def __init__(self, client: ContainerClient):
        """Constructor for AzureGlob class."""
        self.client = client

    def glob(self, pathname: str, recursive: bool = False) -> List[str]:
        """Return a list of paths matching a pathname pattern.

        The pattern may contain simple shell-style wildcards a la
        fnmatch. However, unlike fnmatch, filenames starting with a
        dot are special cases that are not matched by '*' and '?'
        patterns.

        If recursive is true, the pattern '**' will match any files and
        zero or more directories and subdirectories.
        """
        return list(self.iglob(pathname, recursive=recursive))

    def iglob(self, pathname: str, recursive: bool = False) -> Iterator[str]:
        """Return an iterator which yields the paths matching a pathname pattern.

        The pattern may contain simple shell-style wildcards a la
        fnmatch. However, unlike fnmatch, filenames starting with a
        dot are special cases that are not matched by '*' and '?'
        patterns.

        If recursive is true, the pattern '**' will match any files and
        zero or more directories and subdirectories.
        """
        it = self._iglob(pathname, recursive, dironly=False)
        if recursive and self.isrecursive(pathname):
            s = next(it)  # skip empty string
            assert not s
        return it

    def _iglob(self, pathname: str, recursive: bool, dironly: bool) -> Iterator[str]:
        """Function similar to the original glob implementation.

        We replaced
        * os.path.split with self.path_split
        * os.path.lexist with self.lexists
        * os.path.join with self.path_join
        * _glob2 with self._glob2
        * _iglob with self._iglob
        """
        dirname, basename = self.path_split(pathname)
        if not self.has_magic(pathname):
            assert not dironly
            if basename:
                if self.lexists(pathname):
                    yield pathname
            else:
                # Patterns ending with a slash should match only directories
                if self.isdir(dirname):
                    yield pathname
            return
        if not dirname:
            if recursive and self.isrecursive(basename):
                yield from self._glob2(dirname, basename, dironly)
            else:
                yield from self._glob1(dirname, basename, dironly)
            return
        # `os.path.split()` returns the argument itself as a dirname if it is a
        # drive or UNC path. Prevent an infinite recursion if a drive or UNC path
        # contains magic characters (i.e. r'\\?\C:').
        if dirname != pathname and self.has_magic(dirname):
            dirs = list(self._iglob(dirname, recursive, True))
        elif self.isdir(dirname):
            dirs = [dirname]
        else:
            dirs = []
        if self.has_magic(basename):
            if recursive and self.isrecursive(basename):
                glob_in_dir = self._glob2
            else:
                glob_in_dir = self._glob1
        else:
            glob_in_dir = self._glob0
        for dirname in dirs:
            for name in glob_in_dir(dirname, basename, dironly):
                yield self.path_join(dirname, name)

    # These 2 helper functions non-recursively glob inside a literal directory.
    # They return a list of basenames. _glob1 accepts a pattern while _glob0
    # takes a literal basename (so it only has to check for its existence).

    def _glob1(self, dirname: str, pattern: str, dironly: bool) -> Iterable[str]:
        """Helper function that non-recursively glob inside a literal directory.

        Return a list of basenames. Unlike _glob0 it accepts a pattern.

        Function similar to the original glob implementation, but unlike in the
        original glob implementation we provide a prefix to
        _iterdir in order to speed up queries from azure storage by getting
        fewer names before filtering with fnmatch we and replaced
        """
        prefix = self.pattern_prefix(pattern)
        names = list(self._iterdir(dirname, dironly, prefix))

        return fnmatch.filter(names, pattern)

    def _glob0(self, dirname: str, pattern: str, dironly: bool) -> Iterable[str]:
        """Helper function that non-recursively glob inside a literal directory.

        Return a list of basenames. Unlike _glob1 it takes a literal basename
        (so it only has to check for its existence).

        Function similar to the original glob implementation, but unlike in the
        original glob implementation we add the case "if not pattern"
        """
        if not pattern:
            # `os.path.split()` returns an empty basename for paths ending with a
            # directory separator. 'q*x/' should match only directories.
            return [pattern]
        elif dironly:
            if self.isdir(self.path_join(dirname, pattern)):
                return [pattern]
            else:
                return []
        else:
            if self.lexists(self.path_join(dirname, pattern)):
                return [pattern]
        return []

    def glob0(self, dirname: str, pattern: str) -> Iterable[str]:
        """Function not public but can be used by third-party code.

        Function similar to the original glob implementation
        replacing _glob0 with self._glob0
        """
        return self._glob0(dirname, pattern, False)

    def glob1(self, dirname: str, pattern: str) -> Iterable[str]:
        """Function not public but can be used by third-party code.

        Function similar to the original glob implementation
        replacing _glob1 with self._glob1
        """
        return self._glob1(dirname, pattern, False)

    def _glob2(self, dirname: str, pattern: str, dironly: bool) -> Iterable[str]:
        """Helper function recursively yields relative pathnames inside a literal directory.

        Function similar to the original glob implementation
        replacing _rlistdir with self._rlistdir
        """
        """"""
        assert self.isrecursive(pattern)
        yield pattern[:0]
        yield from self._rlistdir(dirname, dironly)

    def _iterdir(self, dirname: str, dironly: bool, prefix: str = "") -> Iterable[str]:
        """Iterate over directory.

        If dironly is false, yields all file names inside a directory.
        If dironly is true, yields only directory names.
        The prefix argument does not exist in original glob implementation.
        It allows to do a narrower search on azure blob.

        Function similar to the original glob implementation but that take a new
        extra argument prefix to accelerate the queries to azure storage.
        """
        if not dirname:
            for entry in self.client.walk_blobs(prefix):
                if not dironly or entry.name.endswith("/"):
                    yield entry.name.rstrip("/")
        else:
            for entry in self.client.walk_blobs(name_starts_with=dirname + "/" + prefix):
                if not dironly or entry.name.endswith("/"):
                    yield entry.name[len(dirname) + 1 :].rstrip("/")

    def _rlistdir(self, dirname: str, dironly: bool) -> Iterator[str]:
        """Recursively yields relative pathnames inside a literal directory.

        Function similar to the original glob implementation.
        We replaced  os.path.join with self.path_join.
        """
        names = list(self._iterdir(dirname, dironly))
        for x in names:
            yield x
            path = self.path_join(dirname, x) if dirname else x
            for y in self._rlistdir(path, dironly):
                yield self.path_join(x, y)

    def path_join(self, x: str, y: str) -> str:
        """Equivalent to os.path.join for azure blob paths."""
        return x + "/" + y

    def isdir(self, name: str) -> bool:
        """Equivalent to os.path.isdir on azure blob."""
        for _ in self.client.walk_blobs(name_starts_with=name.rstrip("/") + "/"):
            return True
        return False

    def lexists(self, name: str) -> bool:
        """Equivalent to os.path.lexists on azure blob."""
        for _ in self.client.walk_blobs(name_starts_with=name):
            return True
        return False

    def path_split(self, name: str) -> Tuple[str, str]:
        """Split path."""
        s = name.split("/")
        head = "/".join(s[:-1])
        tail = s[-1]
        return head, tail

    def pattern_prefix(self, s: str) -> str:
        """Return the part of the string appearing before any magic character.

        This is use to accelerate queries on azure storage.
        """
        match = magic_check.search(s)
        assert match is not None
        return s[: match.span()[0]]

    def has_magic(self, s: str) -> bool:
        """Detect if the string contains any magic character.

        Same implementation as in the original glob module
        """
        match = magic_check.search(s)
        return match is not None

    def isrecursive(self, pattern: str) -> bool:
        """Detect if pattern contains **."""
        return pattern == "**"



def split_azure_url(url: str) -> Tuple[str, str, str]:
    """Split an azure blob url into account_url, container_name and blob_rel_path.

    Expected URL: https://<storage account URL>/<container>/<blob path>
    """
    assert url.startswith(HTTPS_PREFIX)
    assert "\\" not in url, f"You should not have any backslash in URL {url}."
    url_tokens = url[len(HTTPS_PREFIX) :].split("/")
    account_url = HTTPS_PREFIX + url_tokens[0]
    container_name = url_tokens[1]
    blob_rel_path = "/".join(url_tokens[2:])
    return account_url, container_name, blob_rel_path

def glob(glob_str: str, recursive: bool = False, return_urls: bool = True) -> List[str]:
    """Clone of the glob.glob function that run on azure storage.

    Args:
    glob_str: glob-like pattern used to do the search.It should contain the container url as prefix.
    recursive: controls if the search should be recursive when there is "**" in the pattern.
    return_urls: controls if the search returns full urls or only relative path to the container url.
    """
    account_url, container_name, glob_rel = split_azure_url(glob_str)
    with ContainerClient(account_url=account_url, container_name=container_name) as client:
        azure_glob = AzureGlob(client)
        results = azure_glob.glob(glob_rel, recursive=recursive)
        if return_urls:
            results = ["/".join((account_url, container_name, rel_path)) for rel_path in results]
    return results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

2 participants