Python 2.7, comparing 3 columns of a CSV -

August 15, 2011

what easiest/simplest way iterate through large csv file in python 2.7, comparing 3 columns?

i total beginner , have completed few online courses, have managed use csv reader basic stats on csv file, nothing comparing groups within each other.

the data set follows:

group   sub-group   processed 1                 y 1                 y 1                 y 1           b            1           b 1           b 1           c       y 1           c       y 1           c 2           d       y 2           d       y 2           d       y 2           e       y 2           e 2           e 2           f       y 2           f       y 2           f       y 3           g 3           g 3           g 3           h       y 3           h 3           h

everything belongs group, within each group sub-groups of 3 rows (replicates). working through samples, adding processed column, don't full complement, there 1 or 2 processed out of potential 3.

i'm trying work towards statistic showing % completeness of each group, sub group being "complete" if has @ least 1 row processed (doesn't have have 3).

i've managed halfway there, using following:

for row in reader:     all_groups[group] = all_groups.get(group,0)+1        if not processed == "":         processed_groups[group] = processed_groups.get(group,0)+1  result = {} family in (processed_groups.viewkeys() | all_groups.keys()):     if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])         if group in processed_groups: result.setdefault(group, []).append(all_groups[group])  group,v1 in result.items():         todo = float(v1[0])         done = float(v1[1])         progress = round((100 / done * todo),2)         print group,"--", progress,"%"

the problem above code doesn't take account fact sub-groups may not totally processed. result, statistic never read 100% unless processed column complete.

what get: group 1 -- 55.56% group 2 -- 77.78% group 3 -- 16.67%  want: group 1 -- 66.67%% group 2 -- 100% group 3 -- 50%

how make looks see if first row each sub column complete, , use that, before continuing on next sub group?

one way couple of defaultdict of sets. first keeps track of of subgroups seen, second keeps track of subgroups have been processed. using set simplifies code somewhat, using defaultdict when compared using standard dictionary (although it's still possible).

import csv collections import defaultdict  subgroups = defaultdict(set) processed_subgroups = defaultdict(set)  open('data.csv') csvfile:     group, subgroup, processed in csv.reader(csvfile):         subgroups[group].add(subgroup)         if processed == 'y':             processed_subgroups[group].add(subgroup)      group in sorted(processed_subgroups):         print("group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))

output

 group 1 -- 66.67% group 2 -- 100.00% group 3 -- 50.00%

Search This Blog

Color

Python 2.7, comparing 3 columns of a CSV -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -