2
\$\begingroup\$

I'm working on a project where the file name consists of actual dates but the data for the dates are split into multiple files.

Developed the following program to count the number of files for each date (part of the filename) and also the total size. Wondering, if the same can be written in a better way.

import os
import glob
import os
import collections
directory_name = "\\SpeicifDir\\"
# Get a list of files (file paths) in the given directory 
list_of_files = filter( os.path.isfile,
                        glob.glob(directory_name + '*.txt') )
mapOfDateFileSize = collections.defaultdict(list)
# For all the files
for file_path in list_of_files:
    file_size =  os.stat(file_path).st_size
    filename = os.path.basename(file_path)
    splitFilename = filename.split('-')
    # Extract the file and split the file using - as a separator
    dtForFile = splitFilename[1] + "-" + splitFilename[2] + "-" + splitFilename[3]
    # Get the file name and size
    if dtForFile in mapOfDateFileSize:
        dataFromDictionary = mapOfDateFileSize[dtForFile]
        dataFromDictionary = dataFromDictionary[0]
        totalCount = dataFromDictionary[0]
        totalSize =  dataFromDictionary[1]
        totalCount = totalCount + 1 
        totalSize = totalSize + file_size
        # Update the file size and count
        mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
    else:
        mapOfDateFileSize[dtForFile].append((1,file_size))

# For each date get the total size, total file count
for dt,elements in mapOfDateFileSize.items():
    dataFromDictionary = elements[0]
    totalCount = dataFromDictionary[0]
    totalSize =  dataFromDictionary[1]
    print (dt, ",", totalCount , ",", totalSize)
\$\endgroup\$
1
  • \$\begingroup\$ Can you give an example of the text files you are looking for? \$\endgroup\$ Jan 29, 2022 at 11:37

2 Answers 2

6
\$\begingroup\$

Default Dictionary

You are using this wrong.

...
mapOfDateFileSize = collections.defaultdict(list)
for file_path in list_of_files:
    ...
    if dtForFile in mapOfDateFileSize:
        ...
        mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
    else:
        mapOfDateFileSize[dtForFile].append((1,file_size))
...

The whole point of a default dictionary is so you can use the mapOfDateFileSize[key] without worrying about whether the key exists or not. Since you explicitly test if dtForFile in mapOfDateFileSize, you could use a regular dictionary, and create the list yourself in the else: clause with:

...
    else:
        mapOfDateFileSize[dtForFile] = [(1, file_size)]
...

Unnecessary nesting

        dataFromDictionary = mapOfDateFileSize[dtForFile]
        dataFromDictionary = dataFromDictionary[0]
        totalCount = dataFromDictionary[0]
        totalSize =  dataFromDictionary[1]

You are storing a tuple, in a one-element list, in a dictionary. What is the purpose of this one-element list? It is not adding anything, and making things way more complicated.

    if dtForFile in mapOfDateFileSize:
        total_count, total_size = mapOfDateFileSize[dtForFile]
        total_count += 1 
        total_size += file_size
        # Update the file size and count
        mapOfDateFileSize[dtForFile] = (total_count, total_size)
    else:
        mapOfDateFileSize[dtForFile] = (1, file_size)

# Tuple unpacking even works when iterating through the dictionary!
for dt, (total_count, total_size) in mapOfDateFileSize.items():
    print (dt, ",", total_count, ",", total_size)

A Real Default Dictionary

The if dtForFile in mapOfDateFileSize is still annoying. You wanted to use a defaultdict. What you need is a default dict that returns the tuple (0, 0) if no entry exists.

...
mapOfDateFileSize = collections.defaultdict(lambda: (0, 0))
for file_path in list_of_files:
    ...
    total_count, total_size = mapOfDateFileSize[dtForFile]
    total_count += 1 
    total_size += file_size
    # Update the file size and count
    mapOfDateFileSize[dtForFile] = (total_count, total_size)
...

PEP 8

The Style Guide for Python Code has many guidelines you should follow. Of particular note, variable names should be snake_case, not bumpyWordsCommonlyUsedInOtherProgrammingLanguages since they are really hard to read.

\$\endgroup\$
0
1
\$\begingroup\$

An eye for detail

Importing your file causes several lines to change, and PEP warnings poping up. I recommend switching to an editor that provides black on save. For instance

    totalSize =  dataFromDictionary[1]

Should really be

    totalSize = dataFromDictionary[1]

This indicates you being a bit lazy. There are several instances of this. Your imports are this

import os
import glob
import os
import collections

Which, after running isort over it, becomes this

import collections
import glob
import os

Similarly directory_name = "\\SpeicifDir\\" should probably be directory_name = "\\SpecificfDir\\" no? And speaking of,here you suddenly use snake_case when throughout the code camelCase is used?

Comments

If you find yourself using multiple comments, it is a good indication that your structure needs to change. Use more functions, docstrings etc. Comments should explain why not how.

Modern solutions

The standard way to handle files and paths today is pathlib. However, I think that a pandas dataframe would be a more natural container for this type of information.

A pandas solution might look like this

import pandas as pd

directory = Path.cwd()  # current working dir
textfiles = get_files(directory, suffix=".txt")
size_by_date, count_by_date = filesizes_and_counts(textfiles)
dates = list(size_by_date.keys())
sizes = list(size_by_date.values())
counts = list(count_by_date.values())

system_info = pd.DataFrame(list(zip(dates, sizes, counts)))
system_info.columns = ["Date", "Filesize", "Files"]

See below for how the functions are defined.

All in all a modern solution might look like this

from collections import defaultdict
from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable


MATCH_ANY = "*"


def get_files(
    directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
    folder_iter = directory.rglob if recursive else directory.glob
    yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))


def filename_date(filename: str, delimiter="-") -> Optional[datetime]:
    split_file = filename.split(delimiter)
    if len(split_file) < 3:
        return
    try:
        *_, day, month, year = split_file
        return datetime(day=int(day), month=int(month), year=int(year))
    except ValueError:
        return None


def filesizes_and_counts(files: Iterable[Path]) -> tuple[Filesizes, Filecounts]:
    file_sizes = defaultdict(float)
    file_counts = defaultdict(int)
    for path in files:
        if (date := filename_date(path.name)) is None:
            continue
        file_sizes[date] += path.stat().st_size
        file_counts[date] += 1
    return file_sizes, file_counts


if __name__ == "__main__":
    directory = Path.cwd()  # current working dir
    textfiles = get_files(directory, suffix=".txt")
    sizes, counts = filesizes_and_counts(textfiles)
    dates = sizes.keys()
    for date in dates:
        print(
            f"{date.strftime('%Y-%m-%d')}: "
            f"{sizes[date]:>20} bytes, "
            f"files: {counts[date]}"
        )

EDIT: Here is a full blown pandas solution which I think works very nicely

enter image description here

Most of the code is the same, but we let pandas filter our data.

from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
import pandas as pd

MATCH_ANY = "*"


def get_files(
    directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
    folder_iter = directory.rglob if recursive else directory.glob
    yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))


def filename_date(filename: str, delimiter="-") -> Optional[tuple[str, datetime]]:
    split_file = filename.split(delimiter)
    if len(split_file) < 3:
        return
    try:
        *name, day, month, year = split_file
        dt = datetime(day=int(day), month=int(month), year=int(year))
        return delimiter.join(name), dt
    except ValueError:
        return None


def info(files: Iterable[Path]) -> Optional[pd.DataFrame]:
    info_list = []
    for path in files:
        if (m := filename_date(path.stem)) is None:
            continue
        name, date = m
        file_info = (date, name, path.stat().st_size, str(path))
        info_list.append(file_info)
    if not info_list:
        return None
    data = pd.DataFrame(info_list)
    data.columns = ["Date", "Name", "Filesize", "Path"]
    data = data.sort_values(by=["Date", "Filesize"], ascending=False)
    return data.reset_index(drop=True)


def group_info_by_date(data: pd.DataFrame):
    group_by_dates = data.groupby(["Date"], as_index=False)
    df_by_dates = group_by_dates["Filesize"].sum()
    df_by_dates["Files"] = group_by_dates.size()["size"]
    df_by_dates = df_by_dates.sort_values(by=["Filesize", "Date"], ascending=False)
    return df_by_dates.reset_index(drop=True)


if __name__ == "__main__":
    directory = Path.cwd()  # current working dir
    textfiles = get_files(directory, suffix=".txt")

    if (data := info(textfiles)) is not None:

        df_by_dates = group_info_by_date(data)
        print(df_by_dates)
        print()
        print(data[["Date", "Name", "Filesize"]])
\$\endgroup\$
5
  • \$\begingroup\$ for -> Generator[Path, None, None] you can also write -> Iterator[Path], which is a bit less specific, but looks simpler and is enough in most case \$\endgroup\$
    – njzk2
    Jan 29, 2022 at 20:43
  • \$\begingroup\$ I'm not yet convinced about using the morse operator at all \$\endgroup\$
    – njzk2
    Jan 29, 2022 at 20:44
  • \$\begingroup\$ the whole split + *name + delimiter.join can be simplifier by using split_file = filename.rsplit(delimiter, maxsplit=3) \$\endgroup\$
    – njzk2
    Jan 29, 2022 at 20:48
  • \$\begingroup\$ @njzk2 Morse operator? Otherwise I agree! \$\endgroup\$ Jan 29, 2022 at 21:06
  • \$\begingroup\$ I meant walrus, sorry \$\endgroup\$
    – njzk2
    Jan 29, 2022 at 21:50

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged or ask your own question.