{"id":309,"date":"2022-04-20T15:25:45","date_gmt":"2022-04-20T08:25:45","guid":{"rendered":"http:\/\/thnkandgrow.com\/?p=309"},"modified":"2022-04-24T18:52:36","modified_gmt":"2022-04-24T11:52:36","slug":"find-something-in-csv-with-ruby","status":"publish","type":"post","link":"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/","title":{"rendered":"T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSV 13M line"},"content":{"rendered":"\n
\u0110\u1ec1 b\u00e0i to\u00e1n: M\u00ecnh c\u00f3 benchmark 1 s\u1ed1 c\u00e1ch:<\/p>\n\n\n\n Result: <\/p>\n\n\n\n Nh\u01b0 v\u1eady, n\u1ebfu \u0111\u1ecdc file CSV nh\u01b0 1 chu\u1ed7i v\u00e0 d\u00f9ng regex th\u00ec t\u1ed1c \u0111\u1ed9 nhanh nh\u1ea5t. Khi c\u1ea7n m\u00ecnh s\u1ebd g\u1ecdi file m\u1edbi \u0111\u1ec3 l\u1ea5y value ra.<\/p>\n\n\n\n C\u1ea7n t\u1ea1o worker \u0111\u1ec3 update file m\u1edbi v\u00e0o m\u1ed7i t\u1ed1i ho\u1eb7c 1 khung th\u1eddi gian n\u00e0o \u0111\u00f3<\/p>\n\n\n\n Trigger callback khi record \u0111\u01b0\u1ee3c insert v\u00e0o db (c\u00f3 th\u1ec3 s\u1ebd kh\u00f3 n\u1ebfu d\u00f9ng data t\u1eeb database kh\u00e1c)<\/p>\n\n\n\n Risk: <\/p>\n\n\n\n B\u1ea1n c\u00f2n c\u00e1ch n\u00e0o kh\u00e1c hay h\u01a1n kh\u00f4ng?<\/p>\n\n\n\n <\/p>\n","protected":false},"excerpt":{"rendered":" \u0110\u1ec1 b\u00e0i to\u00e1n:T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSVFile CSV c\u00f3 kho\u1ea3n h\u01a1n 13 tri\u1ec7u d\u00f2ng d\u1eef li\u1ec7u (n\u1eb7ng kho\u1ea3ng 500Mb) M\u00ecnh c\u00f3 benchmark 1 s\u1ed1 c\u00e1ch: Result: Nh\u01b0 v\u1eady, n\u1ebfu \u0111\u1ecdc file CSV nh\u01b0 1 chu\u1ed7i v\u00e0 d\u00f9ng regex th\u00ec t\u1ed1c \u0111\u1ed9 nhanh nh\u1ea5t.M\u1ed9t s\u1ed1 riskies: C\u00f3 nhi\u1ec1u request v\u00e0 h\u1ec7 […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[60,58],"yoast_head":"\n
T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name<\/code> trong file CSV
File CSV c\u00f3 kho\u1ea3n h\u01a1n 13 tri\u1ec7u d\u00f2ng d\u1eef li\u1ec7u (n\u1eb7ng kho\u1ea3ng 500Mb)<\/p>\n\n\n\n# CSV file\n\n\"user_name\",\"suspect_flag\"\naadilab,suspect_conversations_flag\naadilab,suspect_searches_flag<\/code><\/pre>\n\n\n\n
require 'benchmark'\ndef ben_csv\n @csv = CSV.read(\"db\/suspect_workers_flag_super_fake.csv\")\n Benchmark.bmbm do |x|\n x.report(\"select array:\") {\n p @csv.select {|x, y| x == \"zylafl\"}.count\n }\n end\nend\n\ndef ben_file\n @csv_str = File.read(\"db\/suspect_workers_flag_super_fake.csv\")\n Benchmark.bmbm do |x|\n x.report(\"string regex:\") {\n p @csv_str.scan(\/zylafl\\,\/).count\n }\n end\nend\n\ndef ben_shellscript\n Benchmark.bmbm do |x|\n x.report(\"shel script:\") {\n p `grep -o 'zylafl,' db\/suspect_workers_flag_super_fake.csv | wc -l`\n }\n end\nend<\/code><\/pre>\n\n\n\n
Rehearsal -------------------------------------------------\nstring regex: 3\n 0.273001 0.001638 0.274639 ( 0.275356)\n---------------------------------------- total: 0.274639sec\n\nRehearsal ------------------------------------------------\nshel script: \" 3\\n\"\n 3.452150 0.595941 13.451730 ( 9.518582)\n-------------------------------------- total: 13.451730sec\n\nRehearsal -------------------------------------------------\nselect array: 3\n 4.287471 0.338820 4.626291 ( 4.680256)\n---------------------------------------- total: 4.626291sec<\/code><\/pre>\n\n\n\n
M\u1ed9t s\u1ed1 riskies:<\/p>\n\n\n\nNot responding<\/code> tr\u00ean server<\/li><\/ul>\n\n\n\n
Another solution from HaVS:
T\u1ea1o file CSV m\u1edbi, trong file m\u1edbi n\u00e0y s\u1ebd t\u1ed5ng k\u1ebft l\u1ea1i s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name.<\/p>\n\n\n\n# CSV file\n\n\"user_name\", \"count\"\nabc,3\nabd,2\nabf,2<\/code><\/pre>\n\n\n\n