Skip to content

Commit 369c05e

Browse files
committed
add docx metadata extractor tutorial
1 parent e086cab commit 369c05e

File tree

5 files changed

+44
-0
lines changed

5 files changed

+44
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
6363
- [How to Build a Username Search Tool in Python](https://thepythoncode.com/code/social-media-username-finder-in-python). ([code](ethical-hacking/username-finder))
6464
- [How to Find Past Wi-Fi Connections on Windows in Python](https://thepythoncode.com/article/find-past-wifi-connections-on-windows-in-python). ([code](ethical-hacking/find-past-wifi-connections-on-windows))
6565
- [How to Remove Metadata from PDFs in Python](https://thepythoncode.com/article/how-to-remove-metadata-from-pdfs-in-python). ([code](ethical-hacking/pdf-metadata-remover))
66+
- [How to Extract Metadata from Docx Files in Python](https://thepythoncode.com/article/docx-metadata-extractor-in-python). ([code](ethical-hacking/docx-metadata-extractor))
6667

6768
- ### [Machine Learning](https://www.thepythoncode.com/topic/machine-learning)
6869
- ### [Natural Language Processing](https://www.thepythoncode.com/topic/nlp)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# [How to Extract Metadata from Docx Files in Python](https://thepythoncode.com/article/docx-metadata-extractor-in-python)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import docx # Import the docx library for working with Word documents.
2+
from pprint import pprint # Import the pprint function for pretty printing.
3+
4+
def extract_metadata(docx_file):
5+
doc = docx.Document(docx_file) # Create a Document object from the Word document file.
6+
core_properties = doc.core_properties # Get the core properties of the document.
7+
8+
metadata = {} # Initialize an empty dictionary to store metadata
9+
10+
# Extract core properties
11+
for prop in dir(core_properties): # Iterate over all properties of the core_properties object.
12+
if prop.startswith('__'): # Skip properties starting with double underscores (e.g., __elenent). Not needed
13+
continue
14+
value = getattr(core_properties, prop) # Get the value of the property.
15+
if callable(value): # Skip callable properties (methods).
16+
continue
17+
if prop == 'created' or prop == 'modified' or prop == 'last_printed': # Check for datetime properties.
18+
if value:
19+
value = value.strftime('%Y-%m-%d %H:%M:%S') # Convert datetime to string format.
20+
else:
21+
value = None
22+
metadata[prop] = value # Store the property and its value in the metadata dictionary.
23+
24+
# Extract custom properties (if available).
25+
try:
26+
custom_properties = core_properties.custom_properties # Get the custom properties (if available).
27+
if custom_properties: # Check if custom properties exist.
28+
metadata['custom_properties'] = {} # Initialize a dictionary to store custom properties.
29+
for prop in custom_properties: # Iterate over custom properties.
30+
metadata['custom_properties'][prop.name] = prop.value # Store the custom property name and value.
31+
except AttributeError:
32+
# Custom properties not available in this version.
33+
pass # Skip custom properties extraction if the attribute is not available.
34+
35+
return metadata # Return the metadata dictionary.
36+
37+
38+
39+
docx_path = 'test.docx' # Path to the Word document file.
40+
metadata = extract_metadata(docx_path) # Call the extract_metadata function.
41+
pprint(metadata) # Pretty print the metadata dictionary.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
python-docx
Binary file not shown.

0 commit comments

Comments
 (0)